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ABSTRACT 


This  study  investigates  load  sharing  in  a system  of  computers 
interconnected  by  a store  and  forward  communication  network.  The 
problem  is  analyzed  by  modeling  both  computers  and  communication 
channels  as  queues  and  evaluating  system  performance  on  the  basis  of 
the  steady  state  expected  time  to  process  computer  jobs  in  the  system. 
Upper  and  lower  bounds  on  this  performance  criteria  are  developed  and 
used  to  define  regions  of  operation  for  a network  using  load  sharing. 

Two  techniques  for  load  sharing  are  then  presented. 

The  first  technique,  called  statistical  load  sharing,  consists  of 
sending  a fraction  of  the  jobs  arriving  at  overloaded  computers  to 
underloaded  computers  by  random  sampling.  This  technique  is  analyzed 
by  a network  of  queues  model.  It  is  shown  that  the  general  formulation 
of  statistical  load  sharing  is  a nonlinear  multicommodity  flow  problem 
which  can  be  solved  by  an  efficient  computer  algorithm.  The  improvement 
in  system  reliability  due  to  the  ability  of  load  sharing  to  provide 
emergency  backup  in  case  of  computer  failure  is  also  studied. 

The  second  technique  for  load  sharing,  a type  of  dynamic  load 
sharing,  makes  job  assignments  to  computers  on  the  basis  of  the  computers 
not  busy  at  the  time  of  assignment.  This  technique  is  analyzed  by  an 
approximation  to  the  hypercube  queueing  model. 
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CHAPTER  I INTRODUCTION 


1.1  Description  of  the  Problem 

Computer-communication  networks  are  a major  area  of  technological 
development  today.  The  main  reason  for  the  current  interest  in  computer- 
communication  networks  is  that  such  networks  are  able  to  provide  system 
capabilities  that  far  surpass  the  capabilities  of  a single  isolated 
computer.  Some  of  the  system  capabilities  that  combined  communication 
and  computer  systems  can  provide  are: 

1)  Remote,  interactive  access  to  time-shared  facilities. 

2)  Sharing  of  computer  data  bases. 

3)  Sharing  of  specialized  computer  resources. 

4)  boad  sharing  among  computers. 

5)  Emergency  backup  in  case  of  computer  failure. 

Ref.  [34] 

The  purpose  of  this  study  is  to  quantify  some  of  the  load  sharing  and 
emergency  backup  benefits  that  a computer-communication  network  can 
provide. 

The  specific  problem  studied  is  load  sharing  in  a system  of  computers 
interconnected  by  a store  and  forward  communication  network.  The  problem 
is  analyzed  by  modeling  both  computers  and  communication  channels  as 
queues.  This  model  is  of  interest  because  it  is  mathematically  tractable. 
It  allows  one  to  evaluate  the  performance  of  a system  of  computers  in 
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terms  of  the  steady  state  expected  time  to  process  a computer  job, 
which  is  the  performance  measure  used  in  this  study.  The  time  to  pro- 
cess a computer  job  includes  the  computation  time  plus  any  contnunication 
time  the  job  may  require  if  it  is  processed  at  a computer  other  than  the 

one  at  which  it  was  submitted.  This  concept  of  load  sharing  is  illus- 

* 

trated  in  Figure  1.1. 

This  study  shows  that  one  of  the  possible  benefits  of  load  sharing 
is  a lower  expected  time  to  process  jobs  in  the  system.  A second  benefit 
of  load  sharing  is  increased  system  reliability  due  to  the  ability  to 
provide  emergency  backup.  A more  detailed  summary  of  these  results  is 
given  after  the  following  brief  discussion  of  the  use  of  store  and  for- 
ward communication  networks  to  interconnect  computers  and  previous 
studies  of  load  sharing. 

1.2  Background 

This  study  considers  store  and  forward  (message  switched)  communi- 
cation networks  since  the  queueing  model  used  to  represent  such  networks 
makes  the  load  sharing  problem  mathematically  tractable.  Moreover,  much 
recent  progress  in  the  area  of  resource  sharing  among  computers  using 
packet  switched  communications,  a form  of  message  switching  in  which  long 
messages  are  divided  and  sent  as  several  packets,  has  been  made  by  the 
Advanced  Research  Projects  Agency  (ARPA)  Network.  The  ARPA  Network  is 


i 


i 

I 


a distributed  packet  switched  system  which  ties  together  many  of  the 


Computer  A 


Queue 
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major  research  computer  facilities  in  the  United  States.  The  goal  of 
the  network  is  to  make  every  local  computer  resource,  both  hardware  and 
software,  available  to  any  user  in  the  network  without  degradation.  In 
attempting  to  meet  this  goal  economically,  it  was  found  that  a distributed 
packet  switched  network  was  an  attractive  design  choice,  as  has  been  dis- 
cussed by  Kahn  [Ref.  15]  and  Roberts  and  Wessler  [Ref.  30], 

One  of  the  reasons  that  packet  switched  communication  networks  are 
appropriate  for  computer  communication  is  that  computer  traffic  tends 
to  be  bursty  in  nature.  [Refs.  7 and  13]  Packet  switching  allows  one 
to  make  good  use  of  comnuni cation  facilities  when  the  traffic  handled  is 
of  this  type.  This  is  because  in  a packet  switched  syotem,  there  is  no 
need  to  switch  communication  circuits  between  source  and  destination 
before  and  after  each  burst  of  traffic.  Instead,  the  communication 
circuits  in  the  network  can  be  shared  by  messages  with  different  desti- 
nations without  incurring  circuit  switching  delays  which  tie  up  the 
circuits  while  not  allowing  data  to  be  transmitted  over  them.  [Ref.  28] 

The  ARPA  Network  experience  has  brought  about  the  serious  consider- 
ation of  packet  switched  communication  networks  as  the  design  choice  for 
future  computer-communication  networks.  This  supports  the  study  of  load 
sharing  using  a store  and  forward  communication  network. 

Load  sharing  in  a network  of  computers  has  previously  been  studied 
by  Bowdon  [Refs.  1 and  2],  The  study  considered  a network  of  computer 
centers  in  which  jobs  of  different  priority  classes  were  processed.  The 
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computers  within  each  center  were  modeled  as  queues  with  finite  length 
buffers,  making  the  system  a system  with  loss.  For  this  network  of 
computer  centers,  a load  sharing  algorithm  to  improve  network  through- 
put was  proposed.  The  dispatching  algorithm  was  to  balance  the  load 
in  the  network  so  that,  for  each  priority  class,  the  expected  waiting 


times  at  all  computers  would  be  equal.  A quantitative  measure  of  the 
improvement  in  system  performance  achieved  by  using  the  load  sharing 


algorithm  was  not  given. 


Roome  and  Torng  [Ref.  31]  have  studied  a type  of  dynamic  load 


sharing  in  a computer-communication  network  where  jobs  are  assigned 
to  computers  for  processing  on  the  basis  of  the  expected  time  to  pro- 
cess them  at  the  various  computers  in  the  system.  They  have  shown  by 
way  of  simulation  that  improvements  in  expected  job  time  in  a distribut- 
ed computer  system  can  be  achieved  by  this  technique. 

Another  study  of  load  sharing  in  a computer  network  has  recently 
been  done  by  McGregor  and  Boorstyn  [Ref.  27],  Their  study  developed  a 
model  for  load  sharing  operation  in  which  both  computers  and  communication 
channels  were  modeled  as  queues.  Computer  jobs  were  dispatched  to  various 
computers  by  random  sampling  and  a modified  gradient  algorithm  was  used 
to  find  the  load  sharing  policy  which  minimized  the  expected  job  time 
in  tlie  network.  This  problem  formulation  is  identical  to  what  is  called 
statistical  load  sharing  in  this  report.  The  McGrecor  and  Boorstyn  study 
was  done  prior  to  and  independently  of  this  report.  There  is  substantial 
overlap  in  the  two  studies  and  where  such  overlap  occurs,  the  results 
agree.  This  report  studies  some  of  the 
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load  sharing  in  greater  detail  than  the  previous  work  and  shows  how  one 
can  apply  an  existing  efficient  algorithm  for  solving  multicommodity 
flow  problems  directly  to  the  load  sharing  problem. 

McGregor  also  studied  the  problem  of  how  to  design  optimum  computer- 
communication  network  topologies  in  which  load  sharing  was  to  be  used 
[Ref.  26],  Heuristic  algorithms  were  presented  for  the  design  of  tree 
topologies  which  minimize  the  weighted  sum  of  network  cost  and  ejected 
job  tine,  and  the  design  of  connected  topologies  which  maximize  through- 
put subject  to  a network  cost  constraint  and  a maximum  expected  job 
time  constraint. 

1.3  Summary  of  Results 

The  queueing  models  used  to  represent  computers  and  store  and  for- 
ward communication  channels,  along  with  the  validity  of  the  modeling 
assumptions,  are  discussed  in  Chapter  2.  Upper  and  lower  bounds  on  sys- 
tem performance  in  terms  of  steady  state  expected  job  time  are  then 
developed  as  follows.  Upper  bounds  are  developed  by  considering  sys- 
tems with  an  infinite  capacity  communication  network  and  an  instanta- 
neous alobal  controller.  The  performance  of  a system  of  computers  with- 
out intercommunication  is  used  as  a lower  bound  on  performance.  The  upper 
and  lower  bounds  are  then  used  to  define  two  reqions  of  load  sharinq  op- 
eration. The  first  region  represents  an  improvement  in  expected  job 


time  due  to  the  correction  of  average  load  imbalances  in  the  system. 
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Operation  in  this  region  cm  be  achieved  by  a technique  that  is  called 
statistical  load  sharing,  which  is  investigated  in  Chapter  3.  The 

r 

second  region  of  improvement  is  due  to  the  benefits  of  using  a large 
system,  rather  than  many  small  individual  systems,  when  job  assignments 
are  made  on  the  basis  of  the  system  state  at  the  time  of  the  assignment. 
Operation  in  this  region  is  called  dynamic  load  sharing,  a limited 
technique  for  which  is  investigated  in  Chapter  4. 

Statistical  load  sharing  improves  system  performance  by  sending 
a fraction  of  the  jobs  that  arrive  at  overloaded  computers  to  under- 
loaded computers  by  random  sampling.  The  analysis  of  this  load  sharing 
technique  in  Chapter  3 starts  by  considering  a number  of  specific  ex- 
amples to  show  some  of  its  main  operating  characteristics.  It  is  shown 
that  the  correction  of  load  imbalances  by  statistical  load  sharing  can 
significantly  improve  the  expected  job  time  in  the  system  at  high  loads. 
Most  importantly,  load  sharing  using  an  adequate  communication  network 

can  increase  the  maximum  possible  system  throughput  with  a load  imbalance 
in  the  system.  After  considering  the  specific  examples,  it  is  shown 

that  the  general  formulation  of  the  statistical  load  sharing  problem  is 
a nonlinear  multicommodity  flow  problem  that  can  be  solved  by  an  efficient 
optimization  algorithm.  The  algorithm  has  been  implemented  [Ref.  61  and 
examples  are  given  of  its  use. 

The  final  topic  investigated  in  Chapter  3 is  load  sharing  operation 
with  failure  in  the  system.  It  is  shown  that  load  sharing  can  increase 
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system  reliability  by  making  the  system  fail  soft,  i.e.,  if  one  computer 
in  the  system  fails,  the  system  can  continue  to  operate  at  reduced  cap- 
acity. Operation  at  this  reduced  capacity,  however,  can  increase  the 
expected  job  time  considerably  and  this  degradation  must  be  accounted  for 
in  system  design.  It  is  also  shown  that  the  failure  of  a coimnunication 
link  in  a load  sharing  system  can  increase  the  expected  job  time  sig- 
nificantly. 

Statistical  load  sharing  can  improve  system  performance  only  by 
balancing  average  loads.  It  is  possible  to  achieve  performance  gains 
beyond  those  attainable  by  such  load  sharing  by  making  job  assignments 
to  computers  dynamically  on  the  basis  of  which  computers  are  available 
at  the  time  of  assignment^  rather  than  by  random  sampling.  Chapter  4 
presents  a way  of  doing  this  by  operating  the  system  using  a global  con- 
troller that  assigns  jobs  on  a first-come-first-serve  basis.  Jobs  are 
assigned  to  the  computer  at  which  they  were  submitted,  if  it  is  not 
busy,  or  to  the  first  available  computer  in  the  system  according  to  a 
preference  list,  if  the  computer  of  origin  is  busy.  This  load  sharing 
technique  is  analyzed  by  an  approximation  to  the  hypercube  queueing 
model  which  represents  such  operation.  It  is  shown  that  using  a com- 


The  dynamic  load  sharing  technique  studied  here  differs  from  that 
studied  by  Roome  and  Torng. 
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Chapter  II  MATHEMATICAL  MODELS  FOR  COMPUTER-COMMUNICATION 
NETWORKS  AND  BOUNDS  ON  SYSTEM  PERFORMANCE 


2.1  Model  of  a Computer 

The  model  of  a computer  used  in  this  study  is  the  simplest  model 
of  a computer  operating  in  a batch  processing  mode.  This  model  is  the 
single  server  queue  with  a Poisson  input  stream  of  jobs  and  a negative 
exponential  service  time  distribution  (M/M/1  queue)  [Ref.  20].  The 
specific  assumptions  that  are  made  when  using  this  model  are: 

1.  The  input  stream  of  jobs  is  a Poisson  process  with  mean 
arrival  rate  X. 

2.  The  number  of  operations  required  per  job  is  distributed 
as  a negative  exponential  with  mean  l/i. 

3.  The  computer  performs  R operations  per  unit  time.  This 
means  that  the  service  time  per  job  (not  including  wait- 
ing time)  is  distributed  as  a negative  exponential  with 
mean  1/iR. 

4.  Jobs  are  processed  in  a first-come-first  served  manner. 

If  the  computer  is  busy  when  a job  arrives,  it  is  queued 
in  an  infinite  buffer. 

The  validity  of  these  assumptions  depends  of  course  on  the  specific 
system  being  studied,  there  being  a wide  range  of  computing  systems  in 
use  today.  For  example,  the  inputs  to  the  computer  could  be  batch  pro- 
grams read  in  through  a card  reader,  inputs  from  an  interactive  terminal 
or  inputs  from  a remote  sensor.  The  assumption  of  a Poisson  input 
stream  may  or  may  not  hold  for  the  system  under  consideration.  For  the 


case  of  inputs  from  a teletypewriter-like  terminal,  Fuchs  and  Jackson 


[Ref.  7]  have  shown  that  the  interarrival  time  between  user  inputs  o.'ten 
fits  a gamma  distribution  which  can  sometimes  be  approximated  by  an  ex- 
ponential distribution  as  required  for  the  input  stream  to  be  Poisson. 
The  assumption  of  an  exponential  service  time  distribution  is  an 


important  one  to  examine.  It  is  important  because  the  performance 
evaluations  made  in  this  study  are  based  on  an  expected  job  time  criteria. 


and  the  service  time  distribution  of  a queue  has  a direct  influence  on 
this  parameter.  The  well  known  Pollaczek-Khintchine  formula  gives  the 
expected  number  of  customers  in  a single  server  queueing  system  with 
Poisson  input  and  general  service  time  distribution  ass 


L = P + 


2 .2  2 
p + X o 

2<l-p) 


where 


A/fc  R 


and  a is  the  variance  of  the  service  time  distribution.  [Ref.  8] 
s 

By  applying  Little's  formula  [Ref.  24] 


it  follows  directly  that  the  expected  time  to  pass  through  the  system, 
W,  depends  on  the  mean  and  variance  of  the  service  time  distribution. 
In  particular,  if  the  variance  of  the  actual  service  time  distribution 


in  a system  is  greater  than  that  of  the  exponential  distribution. then 
the  M/M/1  queueing  model  will  give  an  expected  job  time  which  is  less 


i 
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than  the  actual  expected  job  time.  There  is  reason  to  believe  that  the 
service  time  distributions  of  computers  are  sometimes  high  variance 
distributions.  Figure  2.1  shows  an  example  of  such  service  time  (CPU 
time)  statistics.  While  the  statistics  shown  have  the  general  shape 
of  an  exponential  distribution,  they  have  a very  long  tail  which  gives 
then  a high  variance. 

There  are,  however,  also  studies  in  which  the  exponential  service 
time  assumption  gave  results  that  corresponded  closely  to  actual  sys- 
tem statistics.  An  example  of  this  is  the  study  of  the  Michigan  Terminal 
System  by  Moore.  [Ref.  29] 

The  assumption  of  an  infinite  buffer  makes  the  system  a no  loss 
system.  This  assumption  is  realistic  if  the  system  has  a buffer  of 
such  size  that  overload  occurs  with  extremely  small  probability. 

2.2  Model  of  a Store  and  Forward  Communication  Channel 

The  model  used  for  a store  and  forward  communication  channel  is 
also  an  M/M/1  queue.  As  such,  basically  the  same  type  of  assumptions 
are  made  as  for  the  model  of  a computer.  The  discussions  about  the 
Poisson  input  and  infinite  buffer  assumptions  carry  over  almost  directly. 
The  service  time  assumptions  need  to  be  examined  separately  however.  For 
the  communication  channel  it  is  assumed  that 

1.  The  length  of  messages  in  bits  is  distributed  as  a negative 

exponential  with  mean  1/u  for  programs  (computer  inputs) 

P 

and  mean  1/u^  for  results  (computer  outputs) . 

j 


Percentage  of  Total 
Number  of  Jobs 


CPU  Minutes 


Figure  2.1  Sample  Distribution  of  CPU  Time  per  Computer  Job 
Source  of  data:  M.I.T.  Information  Processing  Center,  Job 

Processing  System  Income  Distribution  Report,  March  1975. 
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2.  The  communication  channel  has  a capacity  of  C bits  per 
unit  time  so  that  the  time  to  transmit  a message  is 
also  distributed  as  a negative  exponential  with  mean 
1/lipC  or  l/UrC. 

3.  If  a message  passes  through  more  than  one  communi- 
cation channel,  its  length  is  chosen  independently 
at  each  channel  through  which  it  passes. 

The  queueing  model  of  a store  and  forward  communication  channel 
has  been  extensively  applied  to  the  analysis  of  the  ARPA  Network  by 
Kleinrock  [Refs.  18  and  19].  In  these  studies  the  analytic  queueing 
model  with  its  exponential  service  time  and  independence  assumptions 
gave  an  accurate  representation  of  the  basic  performance  characteristics 
of  the  network.  However,  in  order  to  match  the  analytic  results  more 
closely  to  simulation  results  of  network  operation,  it  was  found  that 
the  expression  for  average  delay  through  the  network  needed  to  be 
modified  to  include  delays  other  than  those  due  to  the  finite  time 
required  to  transmit  a message  over  a finite  bandwidth  channel  and 
those  due  to  the  resulting  queueing.  Delay  terms  to  account  for  ac- 
knowledgement traffic,  propagation  delays  and  message  processing  delays 


were  added  to  the  analytic  model.  In  this  study  those  delay  terms  will 
be  assumed  to  be  zero.  The  result  is  that  the  system  being  analyzed  is 
an  idealized  system  in  which  communication  delays  result  only  from  the 
limited  capacities  of  communication  channels  and  the  associated  queue- 
ing delays. 

The  independence  assumption  for  messages  that  pass  through  more 
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than  one  communication  channel  is  necessary  to  remove  the  statistical 
dependence  between  the  interarrival  times  and  message  lengths  of  adja- 
cent messages  in  the  network.  With  this  dependence,  an  analytic  solu- 
tion to  the  queueing  network  problem  does  not  exist.  This  independence 
assumption  has  been  studied  in  detail  by  Kleinrock  [Ref.  17]. 

Another  way  in  which  the  M/M/1  queue  model  is  an  idealization  of 
actual  implementations  of  message  switched  networks  is  that  the  model 
assumes  that  each  message  is  transmitted  as  one  block  of  data.  In 
actual  systems,  long  messages  are  often  divided  into  packets,  each  of 
which  can  have  its  own  routing  through  the  network.  The  queueing 
theory  for  message  delay  when  messages  are  divided  into  packets  has 
been  studied  by  Rubin  [Ref.  32].  This  study  of  load  sharing  assumes 
that  messages  are  transmitted  as  one  unit. 

In  modeling  load  sharing  operation,  it  is  important  to  examine  the 

relationship  between  1/p  and  1/u  , the  mean  lengths  of  computer  programs 

P r 

and  computer  results,  respectively.  There  is  evidence  that  the  mean 
length  of  computer  results  is  often  an  order  of  magnitude  greater  than 
the  mean  length  of  input  programs.  [Ref.  13]  It  is,  however,  also 
quite  possible  to  think  of  systems  where  the  input  data  and  the  output 
data  are  more  nearly  equal.  An  example  is  a control  computer  used  to 
process  many  inputs  to  produce  a single  decision.  In  light  of  this, 
both  the  case  of  a ten  to  one  and  the  case  of  a one  to  one  ration  of 
output  to  input  will  be  investigated  in  the  analysis  that  follows. 
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One  final  point  that  needs  to  be  made  is  that  the  distribution  of 
the  message  lengths  of  programs,  the  number  of  operations  they  require 
and  the  message  length  of  results  are  assumed  to  be  independent.  While 
there  is  no  physical  basis  for  this  assumption,  it  is  required  to  make 
the  problem  mathematically  manageable. 

2.3  System  Performance  Measure 

The  measure  used  to  evaluate  the  performance  of  a computer  system 
in  this  study  is  the  steady  state  expected  time  to  process  a computer 
job  submitted  to  the  system.  The  expression  for  this  system  expected 
job  time  is 

N 

V t P (xlenter  at  i)  P (enter  at  i)  dx 
.**,  T 1 r 

i=l  — 

P (enter  at  i ) j x Pm  (xlenter  at  i)  dx 

r j x ' 

x=0 

A. 

T”  E [T . ] (2.1) 

Am  l 
T 

The  time  E[T^]  is  the  expected  time  to  process  a computer  job 
which  enters  the  system  at  the  i th  computer.  This  expected  time  in- 
cludes the  computation  time  required  by  the  job  and,  if  the  job  is  pro- 
cessed by  a computer  other  than  the  one  to  which  it  was  submitted.,  it 
also  includes  the  communication  time  to  send  the  program  to  the  preces- 
sinc  computer  and  the  time  to  return  the  results  to  the  point  of  oriqin 


E [T]  = P 

xrO 

N 

= l 

i=l 

N 

= l 

i=l 
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The  times  E [T^]  are  weighted  by  the  probability  that  an  incoming  job 
is  submitted  to  the  i th  computer,  which  is  the  mean  rate  of  jobs  sub- 
mitted to  the  i th  computer  (A^)  divided  by  the  total  input  rate  to 
the  system  (AT).  These  terms  are  summed  over  all  N computers  in  the 
system  to  give  a system  expected  job  time. 

The  next  two  sections  examine  upper  and  lower  bounds  of  system 
performance  based  on  expected  job  time. 


2.4  Lower  Bounds  on  System  Performance 


A lower  bound  on  the  performance  of  a system  of  computers  using 
load  sharing  is  their  performance  without  any  load  sharing.  With 
each  computer  modeled  as  an  M/M/l  queue  with  mean  service  time  1/i.R^ 
the  expression  for  E [T^]  is  (c.f.  Appendix  A) 


E[T.] 


fcR.  - A. 
x x 


0 < A.  < JtR , 
— x x 


Substituting  into  Equation  2.1  gives 


i 11  A • 

Em  '4  L 


Figure  2.2  shows  several  plots  of  Equation  2.2  for  a system  of  three 


Note  that  a lower  bound  on  system  performance  is  an  upper  bound  on 
expected  job  time  and  that  an  upper  bound  on  system  performance  is  a 
lower  bound  on  expected  job  time.  Whenever  the  terms  upper  and  lower 
bound  are  used  in  this  study,  they  refer  to  system  performance. 


Best  Lower  Bound 


I 
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equal  capacity  computers.^  The  three  curves  shown  in  this  graph  are 

lower  bounds  on  performance  for  three  different  load  balances  in  the 

system.  They  illustrate  the  two  basic  characteristics  of  this  lower 

bound  function  which  are  1)  tr.at  the  function  has  a pole  at  X,  : 

0 < X < ^ HR.  where  N is  the  number  of  computers  in  the  system 

M — 1 

i=l 

and  2)  that  the  more  evenly  the  load  is  distributed  in  the  system,  the 
better  the  lower  bound  on  performance. 

The  pole  in  the  lower  bound  function  occurs  when  one  of  the 
X.  = £R. . This  is  because  for  a steady  state  to  exist  at  each  of  the 

li 

computer  queues,  each  X^  must  be  less  than  £R^.  If  X^  >_  £R^  then  the 
queue  at  computer  i becomes  infinite  and  so  does  the  waiting  time, 
causing  the  system  expected  job  time  to  be  infinite  as  well.  Fiqure  2.2 
shows  how  a load  imbalance  can  thus  severely  degrade  system  performance. 

In  the  case  of  a load  imbalance  where  the  ration  ^!^2:*3  *s  the 

system  pole  occurs  when  X = 1.  The  total  system  load  at  this  point  is 
only  XT  = 1.4.  As  will  be  analyzed  in  the  next  chapter,  the  correction 
of  such  degradation  of  system  performance  due  to  imblanaced  loads  is 
one  of  the  main  benefits  of  using  load  sharing  in  a computer-communication 
network. 

For  the  case  of  equal  capacity  computers,  the  best  lower  bound  on 
performance  is  achieved  when  the  computers  are  equally  loaded.  In 
general,  the  best  lower  bound  can  be  found  by  minimizing  the  expression 


i 


In  this  fiqure,  1/£R  is  taken  to  be  one  unit  of  time.  This  convention 
will  be  followed  throughout  this  study  whenever  all  computers  in  the 
system  have  the  same  processing  rate. 


, 
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for  expected  job  time  (Equation  2.2)  with  respect  to  A.  subject  to  the 
N 1 

constraint  £ A.  = A by  using  Lagrange  multipliers. 
i-1  1 

2.5  Upper  Bounds  on  System  Performance 

Upper  bounds  on  system  performance  for  a system  of  computers  can 

be  obtained  by  using  bounding  models  which  represent  the  best  possible 

use  of  the  computing  resources  available.  With  each  computer  resource 

modeled  as  a single  server  with  an  exponential  service  time  distribution 

of  mean  1/HR^,  there  are  two  possible  upper  bound  models,  depending  on 

the  type  of  system  operation  one  allows.  The  first  bounding  model  is 

a multiserver  queue  with  N servers,  each  with  an  exponential  service 

time  distribution  of  mean  l/£Fh  . The  second  bounding  model  is  a single 

server  queue  with  an  exponential  service  time  distribution  of  mean 
N 

1/  J HR..  These  two  upper  bound  models,  along  with  the  lower  bound 
i-1  1 

model,  are  shown  schematically  in  Figure  2.3. 

The  multiserver  queue  boundinq  model  assumes  that  all  jobs  arriv- 
ing in  the  system  are  served  in  a first-come-f irst-served  manner.  Their 
service  starts  instantaneously  using  the  largest  capacity  computer 
available  in  the  system.  Each  job  is  processes  by  only  one  computer 
at  a time,  but  whenever  a job  leaves  the  system,  the  remainino  jobs  are 
reassigned  so  that  the  largest  capacity  computers  are  always  the  ones 
being  used.  If  all  computers  are  busy,  jobs  are  queued  up  in  order 
and  the  service  of  each  job  in  turn  starts  as  soon  as  a computer  becomes 


Lower  Bound  Model:  N Independent  M/M/1  Queues. 


Upper  Bound  Model  Assuming  No  Parallel  Processing:  M/M/N  Queue 


u 


1/  l £R. 


Upper  Bound  Model  Assuming  Parallel  Processing: 
M/M/1  Queue  With  Mean  Service  Time  1/N2.R 


Figure  2.3  Bounding  Models 
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available  for  it.  This  model  is  an  upper  bound  model  in  that  it  assumes 
there  exists  an  infinite  capacity  (zero  delay)  comnunication  network 
for  sending  programs  and  results  from  one  computer  to  another  if  a job 
is  processed  by  a computer  other  than  the  one  to  which  it  was  submitted. 
It  also  assumes  that  there  is  a global  controller  in  the  system  which 
instantaneously  makes  job  assignments. 

The  single  server  upper  bound  model  also  assumes  an  infinite 
capacity  communication  network  and  an  instantaneous  global  controller. 
The  difference  in  operation  between  it  and  the  multiserver  bounding 
model  is  that  the  single  server  runs  only  one  job  at  a time.  In  order 
for  a distributed  computer  system  to  operate  like  this  single  server, 
each  computer  job  that  enters  the  system  must  be  divided  into  N parts 
which  are  processes  in  parallel  using  all  N computers  in  the  system  at 
the  same  time.  In  this  way  each  job  would  be  run  at  a rate  defined  by 
the  combined  capacity  of  the  computers  in  the  system. 

The  expected  time  to  pass  through  either  of  the  upper  bound  models 
is  the  same  performance  measure  as  system  expected  job  time.  If  all 
computers  have  the  same  capacity,  the  expected  time  to  pass  through  the 
multiserver  queue  is  given  by 

P (A/£R)N  (A  /n£R) 

E[T]  = -°—  5 + 1/fR 

II!(1  - A^N£R)  At 


where  P 


N-l  (X^/fcR) n (X^/tR)" 

£ + tTi 


o < A < NJ.R 
— T 


1 - X^/NiR 


as  given  in  Appendix  A.  The  expected  tine  to  pass  through  the  single 


server  queue  is 


E [T]  = 


l *R.  - X„ 

. , i 1 
1=1 


< Xm<  j iR. 
— T ,L.  l 
1=1 


If  one  is  interested  in  a network  of  computers  of  unequal  capacity, 
the  expected  job  tine  for  the  multiserver  queue  bounding  model  can  be 
derived  as  follows.  Since  the  arrival  stream  of  jobs  is  Poisson,  all 
job  lengths  are  exponentially  distributed  and  jobs  are  always  processed 
on  the  largest  capacity  computer  available,  the  system  can  be  represent- 
ed by  a Markovian  State  diagram  [Ref.  10]  as  shown  in  Figure  2.4.  The 
states  for  this  system  are  the  number  of  customers  in  the  system.  The 


stochastic  differential  equations  which  describe  the  dynamics  of  these 
states  are 


3P  (t) 
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at 


P (t)  - p.  p. (t)  + p_  P,(t) 
To  11  2 2 


3P2(t) 


3t 


XT  Pl(t)  " (X+U2)  P2(t)  + V3  P3(t) 


3pn(t) 


at 


XT  Pn-l(t)  - (X+lV  Pn(t)  + y3  Pn+l(t) 


n — 3,  4,  5 • • • 


where  P (t)  = P [system  in  state  n at  time  t]  and  p,  = £R, ; 
nr  11 


P2  = fcR  + AR2;  P3  = £RX  + AR2  + ZR^ 


Since  a steady  state  result  is  desired,  the  above  equations  are 


3Pn(t) 

solved  with  — 7 = 0 for  all  n,  in  order  to  obtain  recursive 


3 t 


relationships  oetween  the  steady  state  occupancy  probabilities  P^.  Us- 


inq  these  recursive  relationships  it  is  possible  to  write  all  P as  a 

n 


function  of  Pq  and  one  can  then  apply  the  requirement  that 


l P - 1 

L n 
n=o 


in  order  to  solve  for  P . Once  one  has  solved  for  all  P in  this  manner, 

o n 

the  expected  time  to  pass  through  the  queue  (W)  can  be  found  by  first 
calculating  the  expected  number  of  jobs  (L)  in  the  system 


1 


Expected  Job  Time  Upper  Bound 

(1  Unit  = 1/£R)  A Assuming  Parallel 

Upper  Bound  Assuming  No  r \ /Processing: 

Parallel  Processing : M/M/3  \ / M/M/1 

\f  1/3JLR 


igure  2.5  Upper  Bounds  on  System  Performance  for  a Three  Computer  System 
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the  upper  bound  on  system  performance  will  be  taken  to  be  the  perfor- 
mance of  the  multiserver  queue  model. 

It  is  of  interest  to  examine  this  upper  bound  on  performance  as 
a function  of  system  size.  Figure  2.6  illustrates  the  upper  bound  for 
various  size  computer  systems  consisting  of  equal  capacity  computers. 

The  bound  is  plotted  as  a function  of  system  utilization  factor, 
p = X^NJIR.  Also  plotted  is  the  best  lower  bound  for  all  of  the  sys- 
tems. The  best  lower  bound,  plotted  as  a function  of  system  utilization 
factor,  does  not  vary  with  system  size. 

In  Figure  2.6  it  can  be  seen  that  the  upper  bound  improves  with  sys- 
tem size.  The  amount  of  improvement  is  greatest  in  going  from  small  sys- 
tems to  medium  size  systems  and  decreases  as  a function  of  system  size. 

As  an  example,  if  one  considers  going  from  a system  of  two  computers 
operating  at  a utilization  factor  of  0.7  to  a system  of  ten  computers 
operating  at  a utilization  factor  of  0.7,  the  gain  in  the  bound  on  ex- 
pected job  time  is  approximately  0.8/f.R.  In  going  from  a system  of  ten 
computers  to  an  infinitely  large  system,  also  operating  at  a utilization 
factor  of  0.7,  the  qain  in  the  bound  on  expected  job  time  is  less  than 
0.1/iR.  This  suggests  that,  unless  the  system  is  to  be  operated  at  an 
extremely  high  utilization  factor,  there  may  be  little  to  be  gained  in 
terms  of  expected  job  time  by  increasing  the  size  of  the  system  beyond 
about  ten  to  twenty  computers.  An  important  point  to  remember,  however, 
is  that  the  multiserver  bounding  model  assumes  a Poisson  input  stream  of 
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jobs.  This  means  that  the  Poisson  input  streams  of  the  individual 
computers  which  are  combined  into  the  input  stream  of  the  upper  bound 
model  must  be  independent  of  each  other.  If  this  does  not  hold  in  the 
system  under  consideration,  the  upper  bound  on  system  performance  may 
not  be  as  good  as  predicted  by  the  multiserver  queue  model.  In  this 
study  the  required  independence  is  assumed  to  exist. 

With  the  upper  and  lower  bounds  on  system  performance  that  have 
been  developed,  one  can  identify  the  benefits  in  terms  of  expected  job 
time  that  can  possible  be  provided  by  load  sharing  in  a computer-com- 
munication network.  There  are  two  general  regions  of  improvement  as 
depicted  in  Figure  2.7.  The  first  region  of  improvement,  region  A,  is 
the  region  between  the  lower  bound  on  performance  for  an  unbalanced  load 
system  and  the  lower  bound  for  a balanced  load  system.  Given  an  initial- 
ly unbalanced  load,  operation  in  region  A can  be  achieved  by  simply  send- 
ing a fraction  of  the  jobs  arriving  at  the  overloaded  computers  to  the 
under  loaded  computers  in  the  system.  Which  jobs  to  send  can  be  deter- 
mined by  random  samplina.  This  technique  for  load  sharing  will  be  called 
statistical  load  sharing.  It  is  analyzed  in  detail  in  Chapter  3. 

The  second  region  of  load  sharing  operation,  region  B,  is  the  region 
between  the  best  lower  bound  and  the  upper  bound.  Starting  with  a load 
balanced  system  of  computers,  it  is  necessary  in  order  to  achieve  op- 
eration in  this  region  that  job  assignment  to  computers  be  on  the  basis 
of  the  system  state,  i.e.  which  computers  are  available  at  the  time  of 
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CHAPTER  III  STATISTICAL  LOAD  SHARING 
3.1  Network  of  Queues  Model 

In  Chapter  2 it  was  shown  that  for  a system  of  independently 
operating  computers,  system  performance  in  terms  of  expected  job  time 
improves  as  the  total  system  load  is  more  evenly  distributed  among 
the  computers.  Therefore,  if  a system  of  computers  is  operating  in  an 
unbalanced  load  situation,  there  is  the  possibility  of  improving  sys- 
tem performance  by  simply  sending  a fraction  of  the  jobs  that  arrive 
at  overloaded  computers  to  underloaded  computers,  by  random  sampling, 
in  order  to  balance  the  load.  This  technique  of  load  sharing  in  a com- 
puter-communication network,  called  statistical  load  sharing,  is  ex- 
amined in  this  chapter. 

A typical  example  of  statistical  load  sharing  operation  is  shown 
in  Figure  3.1.  In  this  case.  Computer  1 is  loaded  more  heavily  than 
Computers  2 and  3 and  therefore  a fraction  of  the  jobs  which  arrive  at 
Computer  1,  26,  are  sent  to  be  processed  at  the  underloaded  computers. 
In  order  to  evaluate  the  expected  job  time  in  this  example,  one  must  be 
able  to  determine  the  steady  state  expected  time  to  pass  through  each 
of  the  computer  and  communication  queues.  This  can  be  done  by  applying 
the  following  result  for  a network  of  queues  due  to  Jackson  [Ref.  12J. 

The  result  derived  by  Jackson  applies  to  a network  of  M queues  in 


which 


i/UpC: 


Leave  System 


i/urc 


Computer  1 
1/JtR 


A ^ Leave  System 


L/UPC 


A ^ Leave  System 


Computer  3 
1/ HR 


Random  Sampling 
To  Determine 
Destination  of  Job 


Figure  3.1  Statistical  Load  Sharing.  In  this  example  it 


is  assumed  A^  > A2  = A^. 
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1.  Customers  from  outside  the  system  arrive  at  each 

queue  as  a Poisson  stream  with  mean  rate  X . 

m 

2.  Once  served  at  queue  m,  a customer  goes  (instan- 
taneously) to  queue  k (k=l,2,3.  . . M)  with  pro- 
bability 0^.  With  probability 


1 - J 0 the  customer  leaves  the  system. 


3.  Customers  arriving  at  queue  m (from  inside  or 
outside  the  system)  are  served  in  a first-come- 
first-served  manner.  The  service  time  at  each 
queue  is  distributed  as  a negative  exponential 
with  mean  1/p 

m 

The  network  of  computer  and  communication  queues  meets  all  of  these 

■f'f 

requirements  as  discussed  in  Appendix  B. 

For  a network  of  queues  as  described  above,  let  T (.>-1,2,.  . . M) 

m 

be  the  average  arrival  rate  of  customers  at  stage  m from  inside  and  out- 
side the  system.  Then  in  steady  state,  the  following  relations  must 


Fm  = \n  + X 6«*  K 
k=l 


(m-1,2,3.  . . M) 


Mow  let  K be  the  number  of  customers  waiting  and  in  service  at 
m 

queue  m.  The  state  of  the  system  can  then  be  defined  as  the  vector 


Each  queue  can  be  a multiserver  queue.  In  this  analysis,  however,  it 
is  assumed  that  each  queue  is  a single  server. 

+^If  computer  jobs  must  pass  through  more  than  one  communication  channel 
in  succession,  the  independence  assumption  for  communication  service  times 
discussed  in  Chapter  2 must  be  used. 


I 


(K, ,K  , . . . K ) and  the  following  theorem  due  to  Jackson  holds. 
12  M 

THEOREM.  Define  p™  (m-1,2,.  . . M,  K*0,l,2,.  . . ) 
as  the  steady  state  probability  of  there  being  K 
customers  in  a M/M/1  queue  with  mean  input  rate  T 
and  mean  service  time  1/p  , i.e. 


(1  - 


r/p)  <r /V) 

m in  m in 


Then  the  steady  state  distribution  of  the  state  of  the  above  defined 
system  is  given  by  the  products 


V 


provided  T < p for  m=  1,2,.  . . M. 
m m 

This  theorem  states  that  in  steady  state,  the  system  behaves  as 
if  the  queues  in  the  network  were  independent  with  inputs  rates 
This  result  allows  one  to  analyze  the  network  of  queues  model  for 
statistical  load  sharing  by  merely  determining  the  mean  input  rates  to 
each  of  the  computer  and  communication  queues. 

In  the  next  section,  this  approach  is  used  to  examine  statistical 
load  sharing  in  a fully  connected  symmetrical  communication  network 
with  simple  load  imbalances.  This  is  followed  by  a general  formulation 
of  the  problem  that  applies  to  arbitrary  communication  topologies  and 


i 


load  imbalances 
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3.2  Analysis  of  Fully  Connected  Networks  with  Simple  Load  Imbalances 

In  this  section,  statistical  load  sharing  in  a fully  connected 
symmetrical  computer  coranunication  network  with  one  computer  overloaded 
and  all  others  equally  underloaded  will  be  studied.  Analysis  of  this 
special  case  allows  one  to  gain  insight  into  the  basic  operating 
characteristics  of  statistical  load  sharing.  The  approach  used  in 
this  section  is  to  start  by  considering  a three  computer  system  with 
a given  load  imbalance  and  analyzing  it  in  detail.  The  operating 
characteristics  observed  will  then  be  analyzed  as  a function  of  system 
load  imbalance  and  as  a function  of  system  size. 

A Three  Computer  Example 

As  a first  example,  consider  a system  of  three  computers  in  which 
Computer  1 is  loaded  twice  as  heavily  as  each  of  the  other  two  computers 
( s A2 s ^3  = 2:1:1).  The  system  is  assumed  to  be  symmetrical,  i.e.  all 
mean  computer  service  times  are  equal  as  are  all  communication  channel 
capacities.  This  results  in  basically  the  same  situation  as  shown  in 
Figure  3.1.  In  order  to  achieve  statistical  load  sharing,  some  jobs 
arriving  at  Computer  1 are  sent  to  Computers  2 and  3 for  processing. 

Jobs  are  sent  to  Computer  2 with  probability  6 and  also  to  Computer  3 
with  probability  6.  With  probability  1-2  R,  jobs  arriving  at  Computer  1 
are  processed  there. 

Using  this  load  sharing  strategy,  the  expression  for  system  ex- 
pected job  time  is 


mm' 
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N 

E[T]  = l (A  /A  > E[T  ] 
i=l  1 T 1 


\ *2  A3 

TT  E[V  + 1TE[T21  +-EtT3] 

T T T 

/ [<l  - 2B)  ( iriibrY  ) 

6(gpC  - ABi  + *R  - (A“+  SA^  + p'rC  - BAX  ^ 

+ B(upc  - BX1  + Ir  - (X3  + 6X1)  + urc  - ba'x  ) ] 
+ A;  [ *R  - ( a/*  BA^  ] + X;  [ R - ( 3 ♦ BAX)  ] 


(3.1) 


The  expressions  for  the  E[T^1  are  obtained  by  determining  through 
which  computer  and  communication  queues  a job  must  pass.  Since  in 
steady  state  each  of  these  queues  behave  as  if  they  were  independent, 
the  formula  for  the  expected  time  to  pass  through  an  M/M/1  queue  can 
be  applied  directly. 

In  this  example,  the  mean  average  length  of  programs  and  results 

will  be  ass\imed  to  be  equal  ^-^7  = ^-^7  = ~~  . By  also  noting 

p 

that  A„  equals  X , Equation  3.1  can  be  simplified  to 
2 3 
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rx. 


E(T]  = 


t”  (i  - m 


+ 26 


( 1R  - (1  - 28)  Xx  ) 

( uc  - BX^  + SLR  - (X2  + 6X1))j 


^2  r 

\ [ « - (X2  + BXX) 


(3.2) 


For  a given  load  level  in  the  system,  the  8 used  for  load  sharing 
is  the  8 which  minimizes  Equation  3.2.  Figure  3.2  shows  a graph  of 
this  value  of  8 for  several  different  mean  communication  channel  service 
times.  A graph  of  the  associated  values  of  system  expected  job  time  is 
shown  in  Figure  3.3.  While  the  graph  of  expected  job  time  shows  the 
performance  gains  due  to  statistical  load  sharing,  greater  insight  into 
the  load  sharing  operation  can  be  gained  by  examining  the  graph  of 
8 vs.  system  load  first. 

The  graph  of  8 vs.  system  load  shows  three  basic  characteristics 
of  statistical  load  sharinq  operation.  These  are  that  1)  there  is  a 
threshold  of  load  sharing  operation,  2)  using  communication  systems  with 
sufficient  capacity,  there  is  an  asymptote  which  the  probability  of 
sending  a job  approaches  and  3)  for  inadequate  communication  systems, 
this  asymptote  is  not  reached.  Each  of  these  characterists  will  now  be 


examined  separately 


Probab 


Probability  of  Sending  a dob  in  a Three  Computer  System  with 
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The  Characteristic  of  Threshold 

The  threshold  of  load  sharing  operation  occurs  when  the  < xpected 
job  time  of  a system  of  independently  operating  computers  is  equal  to 
the  expected  job  time  of  a system  using  load  sharing  in  the  limit  as 
B -*•  0.  In  this  example,  the  expression  for  the  expected  job  time  of 

the  system  of  independently  operating  computers  is 


E[T]  = 


N X. 

.1  r EITi> 


i=l  T 
X 


_i  _i + !±  —i + h.  _i 

\ tR  ‘ X1  XT  lR  “ x2  XT  *R  " x3 


(3.3) 


The  threshold  condition  for  load  sharing  can  be  found  by  equating 


Equation  3.3  with  the  limit  of  Equation  3.2  as  8 4 0.  Rearranging 

1 


terms  and  using  the  fact  that  X = A and  that  — = — i— 

^ 2 3 y C y C 

r p 


UC 


gives 


lim  X 


r2BX, 


0-*O 


f I L~JL£ 1 L 

+ 2X  f_JL_  - 1 1 / 

L *r  - x2  1R  - (X2  + Bx)  ]/ 


2BX. 


= lim 
B-K) 


yC  - BX, 


*R  - (X2  + BX^) 


(3.4) 


Jj 


Equation  3.6  states  that  the  threshold  of  statistical  load  sharing 
operation  occurs  when  the  incremental  decrease  in  expected  job  time  at 
Computer  1 due  to  load  sharing  is  equal  to  the  incremental  increase  in 
expected  job  time  at  Computers  2 and  3,  due  to  the  jobs  sent  there  from 
Computer  1,  plus  the  expected  time  to  process  a job  by  sending  it  to 
another  computer.  The  expected  time  to  process  at  another  computer  is 
the  expected  communication  time  under  no  load  conditions  (2/uC)  plus 


the  expected  job  time  at  the  other  compuer  l/(iR  - . The  weighting 
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of  the  derivative  terms  is  due  to  the  form  of  the  system  expected  value 
expression. 

Note  that  the  important  characteristic  of  the  communication  system 
is  not  just  the  channel  capacity,  but  the  mean  service  time  which  is 
a function  of  both  the  channel  capacity  and  the  mean  message  length. 

For  this  reason,  the  cases  studied  are  examples  of  various  rations  of 
mean  communication  channel  service  times  (1 /pC)  to  mean  computer  service 
time  (1/2.R)  . 


The  Characteristic  of  an  Asymptote  for  the 
Probability  (3)  of  Sending  a Job 

A second  characteristic  of  statistical  load  sharing  operation  is 
that  the  optimum  probability  (6)  of  sending  a job  from  the  overloaded 
computer  to  each  of  the  equally  underloaded  computers  by  random  sampling 
sometimes  approaches  an  asymptote.  The  asymptote  is  the  value  of 
which  would  distribute  the  load  evenly  in  the  system,  since  this  is 
the  condition  that  gives  the  best  expected  job  time.  This  can  be  seen 
in  the  example  under  consideration  by  examining  Equation  3.1.  The 
asymptote  is  approached  when  the  terms  representing  communication  delay 
in  Equation  3.1  are  small  with  respect  to  the  terms  representing  com- 
putation time.  If  this  is  the  case 


(A„  + BAJ 


R - (X3  + BA^) 


rp  **  - <*2  + 


\ « - u2  + By 


A1(l  - 23) 


>2  - BXX 


X3  ♦ 6A1 


ZB  - (1  - 26)  A, 


iR  + (A2  + 6A1) 


ZR  + (A3  + 0A  ) 


Equation  3.7  is  exactly  the  expression  for  the  expected  job  time 
of  a system  of  independently  operating  computers  with  mean  input  rates 
(1  - 26)  A^f  A2  + 6A^  and  A3  + BA^.  Therefore,  the  best  expected  job 
time  is  obtained  by  choosing  8 so  that  the  loads  will  be  equal  at  all 
three  computers.  For  this  example  this  means 


(1  - 26)  A = A2  + BA3  and  A = 2X3 


which  gives  6 = 1/6  = .167  as  shown  in  Figure  3.2 
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Figure  3.2  also  shows  that  for  some  communication  systems  the 
asymptote  is  not  approached.  This  occurs  when  the  expression  for 
system  expected  job  time  with  the  optimum  6 has  a pole  at  X^  < N£R. 

In  physical  terms  this  means  that  at  the  overloaded  computer,  the 
computer  queue  and  all  the  associated  communication  queues  used  for 
load  sharing  saturate  at  a system  load  XT  less  than  NiR.  For  the  three 
computer  system  considered  here,  this  occurs  when  a communication  net- 
work with  1/viC  = 5/J.R  is  used.  In  this  case  the  overloaded  computer. 
Computer  1,  can  process  fi-R  = 1 job  per  unit  time  and  there  are  com- 
munication facilities  for  sending  another  2(.2S.R)  =0.4  jobs  per  unit 
time  to  be  processed  elsewhere.  This  means  that  all  facilities  at 
Computer  1 saturate  at  X^  = 1.4  or  XT  = 2.8  < NiR  = 3. 

Note  that  if  the  probability  8 of  sending  a job  approaches  the 

asymptote,  the  expression  for  expected  job  time  using  statistical 

load  sharing  does  not  have  a pole  at  XT  < N2R,  whereas  the  expression 

for  the  expected  job  time  of  a system  of  independently  operatino  computers 

with  a load  imbalance  does  have  a pole  at  Xm  < NiR.  This  results  in  a 

T 

significant  gain  in  expected  job  time,  when  X^  is  large,  by  using 
statistical  load  sharing.  More  importantly,  the  statistical  load  shar- 
ing allows  the  svstem  to  operate  at  higher  throughput  rates  with  a 
load  imbalance  than  is  possible  without  load  sharing.  This  can  clearly 
be  seen  in  Figuxe  3. 3 


' 
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Figure  3.3  shows  that  the  expected  job  time  using  statistical  load 
sharino  improves  with  a decrease  in  mean  communication  service  time,  as 
one  would  expect.  It  also  shows  that  for  the  system  under  consideration, 
a mean  total  communication  time  (2/pC)  less  than  the  mean  computation 
time  ( 1/JJ.R)  gives  statistical  load  sharing  performance  that  closely 
approaches  the  performance  of  a balanced  load  system. 

The  Case  of  Different  Mean  Lengths  for 
Computer  Input  and  Output 

As  discussed  in  Chapter  2,  in  some  computer  systems,  the  mean 
message  length  of  results  is  an  order  of  magnitude  greater  than  the 
mean  message  length  of  programs  (inputs) . If  this  relationship  is 
assumed  to  hold  in  the  three  computer  system  with  * 2:1:1, 

Equation  3.1  is  still  the  expression  for  the  expected  job  time,  but  now 
1/u^C  = 10/u^C.  Graphs  of  expected  job  time  and  the  probability  of 
sending  a job  for  this  case  are  shown  in  Figures  3.4  and  3.5  respective- 
ly. These  graphs  have  the  same  general  characteristics  as  those  in  the 
previous  example.  The  main  difference  is  that  the  communication  delay 
incurred  in  the  system  is  essentially  all  due  to  delay  in  returnino  the 
results  to  the  computer  of  origin,  note  particularly  that  when  the  sys- 
tem saturates  at  < NiR,  it  is  the  overloaded  computer  queue  and  the 
communication  queues  for  returning  the  results  that  saturate. 

It  is  of  interest  to  examine  statistical  load  sharing  operation  as 
a function  of  load  imbalance  and  system  size.  Operation  as  a function 
of  load  imbalance  will  be  examined  first. 


Probability  of  Sending  a Job  in  a Three  Computer  System  with 


Consider  a three  computer  system  as  before  in  which  the  load 

imbalance  is  A,:A  ?A,  = 4:1:1.  It  is  assumed  that  1/p  C = 1/p  C = 

12  3 r p 

1/pC.  Graphs  of  expected  job  time  and  the  probability  (B)  of  sendinq 
a job  for  this  case  are  given  in  Figures  3.6  and  3.7  respectively. 

By  comparing  Figures  3.3  and  3.6,  one  can  see  that  usually  the  great- 
er the  load  imbalance  in  the  system,  the  greater  the  ranqe  of  system 
load  At  over  which  statistical  load  sharing  can  provide  improvement 
in  ejected  job  time.  This  can  also  be  seen  by  comparing  Figures  3.2 
and  3.7  and  noting  that  the  thresholds  of  load  sharing  operation  are 
lower  in  the  more  unbalanced  system.  One  exception  to  the  increase  in 
the  useful  range  of  statistical  load  sharing  with  increased  load  im- 
balance is  when  the  communication  system  is  such  that  a pole  occurs  in 
the  expression  for  expected  job  time  at  At  < NIR.  If  this  is  the  case 
the  qreater  the  load  imbalance,  the  smaller  the  value  of  A^  at  which 
the  system  saturates.  This  can  be  seen  by  examining  the  case  of  1/pC  = 
5/5.R  in  Figures  3.2  and  3.7. 

Operating  Characteristics  as  a 
Function  of  Svstem  Size 


Probability  of  Sending  a Job  in  a Three  Computer 
System  with  A.:A~:A,  = 4:1:1 
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the  probability  of  sending  a job  in  a five  computer  system  with  a load 


imbalance  of  :^2 :^3 :^4 :^5  = As  the  three  computer 

system,  load  sharinq  is  accomplished  by  sending  a fraction  (0)  of  the 
jobs  that  arrive  at  Computer  1 to  each  of  the  other  computers. 


Figures  3.10  and  3.11  show  graphs  of  the  expected  job  time  and  the 
probability  of  sending  a job  in  a ten  computer  system.  In  this  system. 
Computer  1 is  also  loaded  twice  as  heavily  as  each  of  the  other  computers 
in  the  system.  Again  load  sharing  is  achieved  by  sending  a fraction  (0) 
of  the  jobs  which  arrive  at  Computer  1 to  each  of  the  other  computers. 

The  main  effect  of  system  size  is  that  as  a system  grows  there 
are  more  computers  to  share  the  overload  with  and  so  a smaller  fraction 
of  jobs  needs  to  be  sent  to  each  underloaded  computer.  As  a result,  the 
communication  channels  in  a large  system  are  less  loaded,  reducinq  the 
delay  per  job  through  them  and  improving  system  performance  in  terms  of 
expected  job  time.  In  the  examples  considered  here,  the  expected  job 
time  in  the  large  system  is  further  reduced  by  the  fact  that  a smaller 
fraction  of  the  jobs  submitted  to  the  system  are  submitted  at  the  over- 
loaded computer.  As  a result,  even  relatively  slow  communication  net- 
works, such  as  the  case  1 /\iC  = 5/J.R,  provide  performance  that  is  quite 
qood  in  a ten  computer  system.^" 

The  increase  in  size  of  the  system  also  improves  the  operation  of 
statistical  load  sharinq  with  slow  communication  networks  by  movino  the 


It  is  important  to  note  that  the  size  of^a  connected  communication 

network  as  considered  here  increases  as  N"  where  N is  the  number  of  com- 
puters (nodes)  in  the  network. 


Expected  Job  Time 


Figure  3.10  Expected  Job  Time  in  a Ten  Computer  System  with 
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system  pole  to  X^  = n£r.  This  can  be  seen  for  the  case  1/WC  = 5/i.R. 

In  the  three  computer  case  with  a load  imbalance  of  ^!^2!^3  = 2:1:1, 
such  a communication  network  qave  a system  pole  at  X^  = 2.8  < N&R. 

In  the  five  computer  case  with  the  same  load  imbalance  such  a communication 
network  (1/MC  = 5/lR)  does  not  give  a system  pole  at  Xt  < NiR.  This  is 
because  as  stated  before,  with  more  computers  with  which  to  load  share, 
each  communication  channel  in  the  fully  connected  network  carries  a 
smaller  amount  of  traffic. 

Summary  of  Operating  Characteristics 
of  Example  Networks 

Together,  the  previous  examples  have  served  to  show  some  of  the 
basic  operating  characteristics  of  statistical  load  sharing  in  a fully 
connected  symmetrical  computer-communication  network  with  simple  load 
imbalances.  In  summary  these  characteristics  are 


1.  Statistical  load  sharing  can  provide  sianificant 
improvement  in  expected  job  time  performance  for 
large  system  loads  by  correcting  load  imbalances. 

Most  importantly,  the  system  can  operate  at  hiaher 
throughput  lev<  Is  than  are  possible  without  load 
sharing. 

2.  There  exists  a threshold  of  load  sharing  in  the 
system  which  is  a function  of  the  mean  communication 
time  and  the  system  load  inblance. 

3.  There  exists  an  asymptote  that  the  optimum  probability 
of  sending  a job  approaches  if  the  capacity  of  the 
communication  network  is  such  that  the  expression 

for  system  expected  job  time  does  not  have  a pole  at 
XT  < This  asymptote  is  a function  of  system 

size  and  load  imbalance. 


.1  L|  I >,»  ■*> 
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4.  The  asymptote  will  not  be  approached  if  the  capacity 
of  the  communication  network  is  such  that  the  ex- 
pression for  system  expected  job  time  has  a pole 

at  < N*R. 

T 

While  these  summary  points  apply  to  the  specific  examples  considered, 
similar  characteristics  will  be  found  in  the  general  case  of  statistical 
load  sharing  which  will  be  formulated  in  the  next  section. 

3.3  Formulation  of  the  General  Statistical  Load  Sharing  Problem 

In  this  section,  the  general  statistical  load  sharing  problem  is 
shown  to  be  a nonlinear  multicommodity  flow  problem  which  can  be  solved 
by  an  efficient  optimization  technique.  The  examples  of  statistical 
load  sharing  given  in  the  previous  section  were  restricted  to  fully 
connected  symmetrical  computer-communication  networks  with  simple  load 
imbalances.  The  general  formulation  given  here  allows  for  arbitrary 
communication  networks,  arbitrary  load  imbalances  and  different  rates 
of  services  (R  ) at  each  of  the  computers. 

It  is  easiest  to  understand  the  equivalence  between  the  statistical 
load  sharinq  problem  and  a multicommodity  flow  problem  by  considering  an 
example.  Fiqure  3.12a  shows  a three  computer  system  with  a partially 
connected  communication  network.  The  logical  flows  of  jobs  that  can 
occur  in  the  system  are  shown  in  Fiqure  4.12b.  Each  computer  is  thought 
of  as  havinq  two  nodes,  an  input  node  and  an  output  node,  connected  by 
a directed  arc  representing  computer  service.  Jobs  arrive  at  the  input 
node  of  a computer  at  rate  There  is  a requirement  that  the  jobs 


■tel 





a)  Three  Computer  system  with  a partially  connected 
communication  network. 


b)  Logical  flow  of  jobs  in  the  network. 

Figure  ''.I?  General  Formulation  cf  Statistical  Load  Sharing 
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arriving  at  input  node  i receive  computer  service  somewhere  in  the  sys- 
tem and  leave  the  system  at  output  node  i*.  The  jobs  can  be  processed 
either  at  the  computer  to  which  they  are  submitted,  or  the  programs  can 
be  sent  to  another  computer  input  node.  The  possible  flows  of  computer 


programs  are  represented  by  the  flows  f in  Figure  3.12b.  The  sub- 


script j identifies  the  communication  channel  over  which  the  actual 
flows  would  occur.  If  a job  is  processed  at  a computer  other  than  the 
one  at  which  it  was  submitted,  the  results  must  be  sent  back  to  the 


computer  of  origin.  The  possible  flows  of  results  are  the  flows  f r_. , 


the  subscript  j again  referring  to  the  communication  channel  over  which 
the  actual  flow  occurs. 

In  order  to  assure  that  jobs  submitted  at  input  node  i leave  the 
system  at  output  node  i’,  all  jobs  are  identified  as  to  their  origin. 
This  means  that  the  return  route  of  a computer  job  is  fixed  once  it  has 
been  processed.  In  the  network  of  queues  model,  this  fixed  routing 
based  on  the  job  origin  is  equivalent  to  the  random  sampling  used  to 
determine  the  flows  in  the  various  routes  through  the  network.  As 
discussed  in  Appendix  B,  this  is  because  the  identification  of  jobs 
in  a Poisson  stream  consisting  of  two  Poisson  substreams  with  different 
origins  is  equivalent  to  random  sampling  of  the  combined  job  stream. 

In  terms  of  the  logical  flows  in  the  network,  the  statistical 
load  sharino  problem  is  now  a multicommodity  flow  problem.  The  flows 


1 

I 
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required  are  for  computer  jobs  to  flow  from  node  i to  node  i'  at  rate 


A^.  Note  that  in  the  logical  flow  network,  this  requirement  can  be 


met  only  by  having  each  computer  job  pass  through  a computer  once  and 
only  once,  as  required  by  the  load  sharing  problem. 

The  multicommodity  flow  problem  is  to  determine  the  optimal  flow 
throuqh  a network  subject  to  a well  defined  convex  objective  function 
and  a set  of  convex  constraints.  In  the  load  sharing  problem  the 


objective  function  is  expected  job  time.  Assuming  that  the  message 


lenaths  for  proqrams  and  results  are  equal  (v  = u = u) , the  ex- 

P r 


pression  for  the  ejected  time  to  pass  through  the  logical  flow  network 


is 


E[T]  = J E[T  through  arc  i]  Pr  [pass  through  arc  i] 

all  arcs 


! 


NC 

I 


f . + f . 

_e 2 Ei- 


. L . fcR.  - f . ,L.  UC.  - (f  . + f .) 

t 1=1  ii  i =1  i n 


(3.8) 


where  N = number  of  computers 


NC  = number  of  communication  channels 
N 

A = \ A.  total  input  rate  of  jobs  to  the  system 

T i=l  1 


f.  = flow  of  jobs  per  unit  time  through  computer  i 


f . = flow  of  jobs  (proqrams)  per  unit  time  through 

communication  channel  j. 


The  formulation  for  \i  / u is  given  by  Equation  3.10. 

p r 


J 

i 


f . = flow  of  jobs  (results)  per  unit  time  throuah 

communication  channel  j. 

The  expressions  for  the  expected  times  to  pass  throuqh  each  of  the  arcs 
in  the  loqical  flow  network  are  the  expected  times  to  pass  throuqh  M/M/1 
queues  with  the  appropriate  mean  service  times  and  input  rates. 

There  are  two  types  of  constraints  that  qo  with  the  expected  job 
time  objective  function.  There  are  flow  requirements  which  specify  a 
flow  of  X^  jobs  from  node  i to  i'  and  there  are  capacity  constriants 
which  require  that  f.  < £R.  (i=l,2,.  . . N)  and  that  f . + f . < UC . 

11  pi  n -\ 

( j=l , 2 , . . . NC) . The  capacity  constraints  serve  to  bound  the  reqion 
of  feasible  flows  throuqh  the  network  with  boundaries  that  represent 
infinite  values  of  the  objective  function.  The  flow  requirements  and 
capacity  constraints  toqether  define  a convex  feasible  region  which 
will  be  denoted  by  F.  This  fact,  toqether  with  the  fact  that  the 
objective  function  is  convex  because  it  is  the  sum  of  convex  functions, 
allows  one  to  apply  the  multicommodity  flow  algorithm  developed  by 
Cantor  and  Gerla  (Ref.  4]  directly  to  the  problem  at  hand.  This  algorithm 
solves  for  the  optimum  flow  of  jobs  (f*)  through  the  logical  flow  network 
which  minimizes  E[T]  subject  to  f e F.  The  optimal  flow  solution 
determines  the  sets  f ^ (i=l,2,.  . . N)  and  f ^ and  f ^ ^ (j=l,2,.  . . NC) 
which,  together  with  the  inputs  X^  (i  1,2,.  . . N)  determine  the  optimal 
statistical  load  sharing  policy  and  the  system  expected  job  time. 


- ...  UFM,  »»!.*,** 
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Carrying  out  the  solution  of  the  optimization  problem  described 

above  requires  a computer  implementation  of  the  Cantor  and  Gerla 

alqorithm.  A particularly  efficient  implementation  of  this  algorithm 

has  been  developed  by  Defenderfer  [Ref.  6] . The  optimization  alqorithm 

+ 

is  based  on  the  efficient  qeneration  of  extremal  flows  of  the  region  F 
and  the  subseauent  optimization  over  the  reqion  described  by  these  extremal 
flows.  The  alqorithm  does  this  as  follows.  One  starts  with  an  initial 
set  of  extremal  flows  (4>^,  ^2'*  * * calle<3  a basis.  Usinq  this 

basis,  a flow  f is  found  which  is  a convex  combination  of  the  basis 


elements,  i.e.  f = 


f fq  <t).- 


where  a.  > 0 and 


a.  = 1.  In 

1-1  1_  i 

particular,  one  finds  the  f that  minimizes  E[T]  ( ) «.<!>.) 

i=l  1 1 


•t.  } n.  = 1 and  «,  > 0. 

i-1  1 


This  flow  will  be  denoted  f*,  called  the 
current  estimate  of  the  solution  to  the  problem  minimize  E[T] (f)  subject 
to  f e F.  This  flow  may  only  be  an  estimate  because  not  every  element 
of  F can  be  expressed  in  terms  of  the  current  basis  elements. 

One  now  tests  if  f*  is  the  solution  to  the  main  problem  minimize 
E[T] (f)  s.t.  f e F by  checking  if  there  exists  an  f e F such  that 


Cost  (f)  = <7  E [T] (f*) , (f  - f*)  > < 0 


where  <•>  is  the  usual  inner  product  in  Euclidian  space  and  V denotes 
the  gradient.  If  no  such  f exists,  f*  solves  the  main  problem.  This 


In  the  previous  section,  the  characteristics  of  threshold  and 


asymptotic  load  sharing  policies  were  shown  to  exist  in  specific  sym- 
metric examples.  These  characteristics  are  found  in  general  statistical 
load  sharing  problems  as  well.  Figures  3.13  and  3.14  show  two  load  shar- 
ing operating  points  for  a ten  computer  network.  In  this  network,  the 
odd  numbered  computers  have  a capacity  such  that  = 0.5,  while  the 
even  numbered  computers  have  a capacity  such  that  ^ = 2.0.  All  com- 
munication links  are  full  duplex  and  the  channel  capacity  is  such  that 
^ = 2.0.  In  both  figures  the  load  imbalance  is  the  same,  i.e.  computer 
2 is  the  overloaded  computer.  The  difference  between  the  two  cases  is 
that  in  Figure  3.13  the  total  network  load  is  greater.  In  fact,  in 
Figure  3.13  the  network  is  near  saturation  and  one  can  clearly  see  the 
asymptotic  nature  of  the  solution.  The  load  is  fairly  evenly  distributed 
among  computers  of  equal  capacity  even  though  they  are  at  varying  dis- 
tances from  the  overloaded  computer. 

Figure  3.14  shows  the  network  in  a lightly  load  situation  in  which 
not  all  of  the  load  sharing  flows  have  reached  threshold.  Note  that 
this  means  that  there  is  a "radius"  over  which  load  sharing  occurs.  In 
large  networks,  one  may  want  to  confine  the  load  sharing  to  small  regions 
and  this  example  shows  that  in  some  cases  this  may  yield  an  optimal 


solution . 


Overloaded  Computer 
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Figure  3.13  A Ten  Computer  Load  Sharing  Example 
(A_  = 7 jobs/un it  time) 


Fiqure  3.14  A Ten  Computer  Load  Sharinq  Example 
(1  = 4.2  jobs/unit  time) 
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3.4  A Heuristic  Load  Sharing  Algorithm  For  Ring  Networks 

It  has  been  shown  that  the  characteristics  of  threshold  and 
asymptotic  load  sharing  policies  are  generally  found  in  statistical 
load  sharing  problems.  This  section  uses  these  characteristics  to 
develop  a simple  load  sharing  algorithm  for  symmetric  ring  networks 
(symmetric  in  the  sense  that  all  computers  have  equal  capacity) . 

The  basic  idea  behind  the  algorithm  is  to  first  test  if  neighbor- 
ing computers  are  above  or  below  the  threshold  of  load  sharing.  If  they 
are  above  threshold,  the  load  they  are  servicing  is  distributed  evenly 
between  them.  This  is  done  since  even  division  of  load  is  the  asymp- 
totic load  sharing  policy  for  equal  capacity  computers. 

Figure  3.15  gives  a flow  chart  for  the  algorithm.  The  list  of  all 
possible  load  sharing  pairs  of  computers  enables  one  to  keep  track  of 
the  load  sharing  decisions  already  made  as  the  algorithm  proceeds.  In 
this  way  one  can  avoid  inconsistencies  such  as  load  sharina  both  ways 
between  two  computers.  The  equation  used  to  determine  if  two  computers 
are  above  or  below  threshold  is  a straightforward  modification  of  Equation 
3.6.  Clearly,  if  > HR  load  sharing  action  must  be  taken  if  possible. 

As  an  example  of  the  use  of  the  algorithm,  consider  the  network 
shown  in  Figure  3.16.  In  this  network,  computers  1 and  2 are  overloaded 
and  computers  3 and  4 are  underloaded.  Figure  3.17  shows  the  system 
expected  job  time  for  the  example  when  1)  no  load  sharing  is  used,  2) 
the  heuristic  algorithm  is  used,  and  3)  when  the  Cantor  and  Gerla 
algorithm  is  used.  It  can  be  seen  that  while  not  optimal,  the  heuristic 
algorithm  works  well.  This  is  particularly  significant  since  the  alao- 

f ’ 

tekl  I 


rithn  is  so  simple.  VThile  the  algorithm  is  simple  for  symmetric  rinq 
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Figure  3.15 


Flow  Chart  of  a Heuristic  Load  Sharing  Algorithm  for  Ring  Networks 
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Fiqure  3.16  A Four  Computer  Load  Sharing  Example 


Heuristic  Algorithm 


Figure  3.17  Expected  Job  Time  For  Four  Computer  Example  Shown 
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in  this  system,  the  jobs  submitted  at  the  down  computer  are  all  sent  to 
be  processed  elsewhere.  If  the  system  load  was  balanced  before  the 
failure,  the  best  policy  is  to  distribute  the  jobs  from  the  down  computer 
evenly  among  the  computers  which  are  still  functioning.  Figure  3. 18 
shows  the  performance  of  the  ten  computer  example  when  this  is  the  case. 

It  can  be  seen  that  statistical  load  sharing  allows  the  system  to 
operate  at  a reduced  capacity  and  performance  when  computers  fail  in  the 
system.  In  the  example  considered  here  the  system  can  operate  up  to 

X = IJ  HR,  where  N is  the  number  of  computers  working  in  the  system. 

T w w 

This  is  because  the:  communication  facilities  at  each  computer  are  suf- 
ficient to  service  all  jobs  that  arrive  at  down  computers.  If  this  is 
not  the  case,  the  system  will  saturate  at  X^,  < N^fcR.  This  issue  of  the 
saturation  of  the  communication  facilities  before  the  saturation  of  all 
computer  facilities  is  the  same  as  discussed  in  Section  3.2. 

It  is  important  to  note  that  there  is  a reduction  in  system 
performance  as  well  as  system  capacity  when  computers  fail.  This  is 
because  of  the  communication  delay  incurred  by  jobs  submitted  at  down 


computers  and  because  the  computers  still  operating  are  now  more  heavily 
loaded.  If  expected  job  time  is  critical  for  the  system  under  consider- 
ation, the  system  may  be  considered  inoperable  even  when  there  is  still 
enough  computer  capacity  to  process  all  jobs  submitted  to  the  system. 


i 


Expected  Job  Time 


Figure  3.18  Expected  Job  Time  in  a Ten  Computer  System 
with  Computer  Failure 
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The  Effects  of  Communication  Channel  Failure 
The  performance  of  a load  sharing  system  can  be  degraded  by  com- 
munication channel  failure,  as  well  as  by  computer  failure.  If  the 
channel  which  fails  carries  a significant  amount  of  load  sharing  traffic, 
this  failure  can  substantially  increase  the  system  expected  job  time. 

As  an  example  of  this,  consider  the  network  shown  in  Figure  3.16  and 
assume  that  the  communication  channels  between  computers  1 and  4 fail. 

The  system  can  continue  to  operate  with  this  failure,  but  the  system 
expected  job  time  is  increased  as  shown  in  Figure  3.19. 

In  communication  networks  that  are  not  perfectly  reliable,  there  is 
a finite  probability  that  not  every  computer  node  will  have  access  via 
communications  to  every  other  computer  node.  This  issue  related  to  the 
reliability  of  the  communication  subsystem  must  be  taken  into  consider- 
ation when  analyzing  the  failure  characteristics  of  computer-communication 
networks.  A good  survey  of  the  literature  on  the  topic  of  reliability 
of  the  communication  subsystem  is  given  by  Wilkov  [Ref.  35] . 


Figure  3.19  Expected  Job  Time  in  a : 
Communication  Failure.  The  network 
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CHAPTER  IV  DYNAMIC  LOAD  SHARING 


4.1  Dynamic  Load  Sharing  Using  a High  Capacity  Communication  Network 


In  Chapter  2 it  was  shown  that  there  is  a region  of  load  sharing 
operation,  called  the  dynamic  load  sharing  region,  where  one  achieves 
improvements  in  expected  job  time  beyond  those  attainable  by  balancing 
average  loads  such  as  was  done  with  statistical  load  sharing.  In  order 
to  achieve  such  gains  it  is  necessary  to  use  a load  sharing  technique 
that  assigns  jobs  to  computers  on  the  basis  of  which  computer  is  the 
most  desirable  to  use  at  the  time  of  assignment.  This  section  investi- 
oates  one  such  technique  which  can  achieve  dynamic  gains,  starting  with 
a load  balanced  system,  when  the  communication  network  being  used  is  a 
fully  connected  network  of  high  capacity. 

Description  of  the  Dynamic  Load  Sharing  Technique 
The  dynamic  load  sharing  technique  considered  here  operates  with 
a global  controller  that  uses  an  instantaneous  communication  network 
for  control  purposes  which  is  separate  from  the  communication  network 
used  for  load  sharing.  The  controller  assigns  all  incoming  jobs  to 
computers  on  a alobal  first-come-first-served  basis.  If  a job  arrives 
at  a tine  when  the  computer  to  which  it  was  submitted  is  busy,  it  is 
immediately  assigned  to  the  first  available  computer  according  to  a 
preference  list.  If  all  computers  are  busy,  the  job  is  queued  at  tne 


I 


‘ 


I 


computer  at  which  it  was  submitted  and  is  assigned  to  the  first  computer 


that  becomes  available. 


When  a job  is  assigned  to  a computer  other  than  the  one  at  which 


it  was  submitted,  the  computer  to  which  it  is  assigned  is  reserved  for 


it  during  the  time  the  program  is  sent  to  that  computer  as  well  as  dur- 


ing the  time  the  results  are  being  returned  to  the  computer  of  origin. 


This  assures  that  once  a job  assignment  is  made,  the  job  will  find  the 


computer  available  when  it  arrives  to  be  processed. 


The  performance  of  this  dynamic  load  sharing  technique  will  now  be 


analyzed  by  first  considering  a simple  approximation  to  its  performance. 


This  approximation  will  then  be  improved  by  an  approximation  developed 


by  Larson  (Ref.  23]  of  the  hypercube  queueing  model  which  describes 


queueing  systems  in  which  servers  and  customers  are  identified  by  spatial 


locations. 


A First  Approximation  To  Dynamic 
Load  Sharing  Performance 


The  dynamic  load  sharing  technique  under  consideration  operates  in 


a manner  that  is  very  similar  to  a multiserver  queue.  The  difference  is 


that  in  the  load  sharing  system,  both  arriving  jobs  and  computers  are 


distinguishable  as  to  their  spatial  location.  This  means  that  there  is 


a preferred  computer  for  each  job  that  arrives.  It  also  means  that  some 


jobs  must  undergo  a communication  delay  while  others  do  not. 
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One  way  to  develop  a first  approximation  of  the  performance  of  this 
dynamic  load  sharing  technique  is  to  consider  the  performance  of  a par- 
ticular multiserver  queue  that  must  clearly  perform  worse  than  the  load 
sharing  system.  Such  a multiserver  is  shown  in  Figure  4.1a.  In  this 
multiserver  job  assignments  are  made  dynamically  as  in  the  load  sharing 
system,  but  it  is  assumed  that  every  job  must  undergo  a communication 
delay,  whether  or  not  it  is  processed  by  the  computer  at  which  it  was 
submitted.  Clearly,  the  performance  of  this  multiserver  must  be  worse 
than  that  of  the  actual  load  sharing  system.  In  order  to  analyze  this 
multiserver  bounding  model,  one  must  first  determine  the  service  time 
distributions  for  each  of  the  computer  and  communication  stages  in  the 
queue.  The  computation  time  is  of  course  distributed  as  a negative 
exponential  with  mean  1/iR  since  once  a job  is  assigned  to  a computer, 
it  is  assured  that  the  computer  will  be  available  when  the  job  arrives 
to  be  processed.  The  time  to  pass  through  a communication  channel, 
however,  involves  both  waiting  time  and  transmission  time.  Because  the 
system  under  consideration  has  a fully  connected  communication  network, 
there  is  only  one  situation  in  which  queueing  occurs  in  a communication 
channel.  This  situation  is  depicted  in  Fiqure  4.1b.  A job  submitted 
to  Computer  1 was  assigned  to  Computer  2 because  Computer  1 was  busy 
at  the  time  of  assignment.  Before  Computer  2 finishes  processing  this 
job.  Computer  1 becomes  available  and  a job  arrives  at  Computer  2 which 
is  assigned  to  Computer  1.  This  means  that  a computer  program  is  being 
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Figure  4.1  First  Approximation  Model  for  Dynamic  Load  Sharing 
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transmitted  to  Computer  1.  Meanwhile,  Computer  2 finishes  the  job  it 
was  processing  and  wants  to  send  the  results  back  to  Computer  1,  but 
the  communication  channel  is  busy.  The  situation  could  also  have 
occurred  in  the  reverse  order,  the  results  being  sent  before  the  program 
and  the  program  therefore  having  to  wait  to  use  the  communication  channel 
This  contention  between  one  program  message  and  one  result  message  is 
the  only  queueing  in  the  communication  channel  that  occurs  when  a fully 
connected  communication  network  is  used  with  the  dynamic  load  sharing 
technique  presented  here.^  In  consequence,  one  way  to  assure  that  the 
performance  of  the  first  approximation  model  is  indeed  worse  than  that 
of  the  actual  system  is  to  assume  that  every  message  must  wait  for  one 
other  message.  The  distribution  of  the  time  to  pass  through  a communica- 
tion channel  is  then  a second  order  Erlang  distribution  of  mean  2/UC  (the 
convolution  of  two  negative  exponentials  of  mean  1/yC) . This  is  shown 
in  Fiqure  4.1a. 

The  first  approximation  model  for  dynamic  load  sharing  is  now  a 
well  defined  multiserver  queue  with  a qeneral  service  time  distribution 
(M/G/H  queue) . The  general  service  time  probability  density  function 
of  this  queue  is  the  convolution  of  the  density  functions  of  each  of 
the  computer  and  communication  stages.  Recause  analytic  results  do 
not  exist  for  an  M/G/N  queue,  it  is  necessary  to  approximate  its  per- 
formance by  the  performance  of  a multiserver  queue  for  which  results 


The  queueing  referred  to  here  is  queueing  after  a job  has  been  assigned 
to  a computer. 
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are  available.  The  multiserver  queue  which  will  be  used  is  a queue 
with  Poisson  input  and  a second  order  Erlang  service  time  distribution 
(M/E^/N  queue). Using  an  queue,  with  the  same  mean  service 

time  as  the  M/G/N  queue  it  is  approximating,  it  is  now  possible  to 
estimate  the  performance  of  the  dynamic  load  sharing  technique  under 
consideration.  Figure  4.2  shows  a graph  of  this  estimate  for  two 
different  mean  communication  channel  times  in  a three  computer  system. 
Also  shown  are  the  upper  bound  for  dynamic  load  sharing  and  the  per- 
formance of  a load  balanced  system  of  independent  computers.  The 
region  between  these  two  curves  represents  operation  that  is  achieving 
dynamic  load  sharing  gains. 

It  can  be  seen  that  both  the  computer-communication  networks  con- 
sidered achieve  dynamic  load  sharing  gains  for  some  values  of  total 
system  load  X^.  The  network  with  a mean  communication  channel  time 
1/uC  = 1/100&R,  does  so  over  a wide  range  of  X^  and  it  closely  approaches 
the  upper  bound,  as  one  would  expect  for  a very  high  capacity  communica- 
tion network.  The  network  with  1/VC  = l/lO^R,  however,  achieves  only 
very  small  dyanamic  gains  over  a very  small  range  of  X The  reason 
for  this  is  that  the  first  approximation  model  assumes  that  every  job 
must  undergo  a communication  delay  that  involves  queueing  and  trans- 
mission time  for  both  programs  and  results.  As  a result,  for  a network 


The  M/E  /U  queue  is  discussed  in  Appendix  A. 
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with  1/yC  = 1/10S.R,  the  mean  service  time  for  the  corresponding  ap- 
proximation model  is  1/X.R  + 4/pC  = 1.4/lR.  This  gives  an  expected 
job  time  of  1.4/iR  at  = 0 and  a system  pole  at  XT  = Ni.R/1.4,  which 
means  that  only  minimal  dynamic  load  sharing  benefits  are  obtained. 

As  stated  above,  the  reason  that  the  first  approximation  model 
for  a network  with  1/yC  = 1/10&R  does  not  achieve  signficant  dynamic 
load  sharing  gains  is  that  it  is  assumed  that  every  job  incurs  a com- 
munication delay.  In  actual  system  operation  this  is  obviously  not 
true,  since  some  jobs  are  processed  by  the  computer  at  which  they  were 
submitted.  The  first  approximation  of  dyanamic  load  sharing  performance 
will  now  be  refined  by  using  an  approximation  to  the  hypercube  queueing 
model  to  determine  the  probability  of  a job  being  processed  at  a computer 
other  than  the  one  at  which  it  was  submitted.  The  probability  of  not 
sending  a job  elsewhere  to  be  processed,  1 - P (send),  will  then  be 
used  to  modify  the  first  approximation  model  as  shown  in  Figure  4.3. 

The  probability  of  not  sending  a job  has  been  included  as  an  impulse 
at  the  origin  of  the  density  function  of  the  communication  service  time, 
representing  the  fact  that  if  a job  is  not  sent,  it  incurs  no  communic- 
ation delay. 

Determining  the  Probability  of  Sending  a Job  by  an 
Approximation  to  the  Hypercube  Queueing  Model 


The  hypercube  queueing  model  is  a multiserver  queue  which  has  iden- 
tifiable servers  and  customers.  It  has  been  used  mainly  to  analyze  the 
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operation  of  urban  emergency  service  systems  in  which  police  cars  or 
other  emergency  vehicles  are  the  servers  and  calls  for  assistance 
represent  customers.  Both  the  servers  and  customers  are  identified  as 
to  their  location.  This  is  analogous  to  the  load  sharing  system  in 
which  jobs  and  computers  are  identified  as  to  their  location. 

The  following  approximation  to  the  hypercube  model  is  based  on  the 
work  of  Larson  [Ref.  23] . It  has  been  modified  slightly  to  fit  the 
load  sharing  problem. 

In  the  computer-communication  network  model,  one  is  interested  in 
determining  the  probability  that  a job  submitted  at  Computer  i (i=l,2, 

. . . N)  is  processed  by  Computer  j (j=l,2,.  . . N)  . Job  assignments 
are  made  according  to  the  rule  that  a job  is  processed  by  the  computer 
at  which  it  was  submitted,  if  that  computer  is  not  busy.  If  the  computer 
of  origin  is  busy,  the  job  is  assigned  to  the  first  available  computer 
according  to  a preference  list  and  if  all  computers  are  busy,  the  job 
is  queued  and  assigned  to  the  first  computer  that  becomes  available. 

The  probability  of  sending  a job  from  Computer  i to  Computer  j 
in  a system  like  this  is  the  probability  that  in  a random  sampling 
without  replacement  one  would  find  Computer  i busy  and  Computer  j free. 


^ The  hypercube  model  assumes  that  travel  time  to  a customer  (equivalent 
to  communication  time)  is  part  of  the  customer  service  time  (equivalent 
to  computation  time) . The  modification  that  has  been  made  here  is  that 
communication  time  and  computation  time  are  considered  to  be  separate. 

This  is  accomplished  by  usinq  the  additional  cormunication  stages  in 
the  multiserver  model  which  makes  the  system  an  M/G/N  queue.  The  standard 
hpyercube  approximation  model  is  an  M/M  /N  queue. 
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In  general,  in  assigning  a job,  one  samples  computers  in  order  of  pre- 
ference without  replacement  until  a free  one  is  found.  For  a system 
in  which  each  computer  is  equally  loaded,^  the  probability  that  a job 
is  sent  to  the  jth  preference  computer  and  that  the  job  was  not  queued 


PIBiBr  • ViV 


where 

B.  = event  that  the  jth  server  selected  at  random 
^ is  busy. 

Q 

F = B.  = event  that  the  jth  server  selected  at  random 
^ 3 is  free. 

By  using  conditional  probabilities,  one  can  write 


P(D1B2-  ' ' J P,BlV  ’ • BjVl  1 V P(SK> 


j=l,2, . . . N-l 


^For  a system  in  which  each  server  is  equally  loaded,  the  probabilities 
PfB^B^.  . . B.  F . } can  be  taken  directly  as  probabilities  of  sending  a 
job  because  witale^the  events  B.  and  F.  refer  to  servers  chosen  at  random, 
each  server  appears  to  be  the  slame  if^each  is  equally  loaded.  Therefore 
the  probability  of  finding  a particular  combination  cf  computers  busy 
and  free  is  the  same  as  findinq  a random  combination  busy  and  free.  This 
is  not  the  case  if  the  servers  are  unequally  loaded.  The  case  of  unequal 
loads  is  treated  by  Larson  [Ref.  23]  and  Jarvis  [Ref.  14]  for  emergency 
service  systems  in  which  travel  time  (equivalent  to  communication  tine) 
is  not  considered  separate  from  customer  service  time  (equivalent  to 
computation  time) . The  analysis  presented  here  does  not.  extend  directly 
to  the  case  of  unequal  loads. 


The  conditional  probabilities  on  the  right  hand  side  of  Equation  4.1 
are  now  easily  found.  For  example  P{B  |s  } is  the  probability  that  the 
first  server  selected  at  random  will  be  busy,  given  that  K servers  in 
the  H server  system  are  busy.  Clearly, 

P{B1  I SK}  = K/N 


Similarly,  given  that  the  first  selected  server  is  busy  and  that  a total 
of  K servers  are  busy. 


'T  • — 


In  general. 
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P{Bi  ' B1B2 


B . S } 
i-1  K 


K - (i  - 1) 
N - (i  - 1) 


i * 1,2 


K + 1 


0 if  i > K + 1 


Similarly, 


p(r)+i  | Bib2  . . . b,sk) 


N - K 
N - j 


j = 0,1,  • • . K 


0 if  j > K 


There fore, 


P{D1B2 


3 3+1 


N— 1 

y £ 

K=j  N 


K K - 1 K - (j  - 1)  N - K , , 

N - 1 * * * N - (j  ~ 1)  » ~ 3 


j = 1,2,  . . . N-l 


f 


Assuming  that  one  can  solve  for  the  probabilities  P{SK}  in  the  system 
under  consideration,  one  can  easily  determine  the  probabilities 
P {B1B2  * * * BjF-j+l^  which  9ive  the  probability  of  sendinn  a job  to 
the  j+1  preference  computer  and  that  the  job  was  not  queued  before 
assignment.  If  the  job  was  queued  before  assignment  then  all  computers 
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were  busy  when  it  arrived  and  it  is  assigned  to  the  first  computer 
which  becomes  available.  The  probability  of  a job  being  assigned  to 
any  particular  computer  in  this  case  is  1/N  in  steady  state.  The  total 
probability  of  a job  being  sent  to  be  processed  at  a computer  other 
than  the  computer  of  origin  is  then 


P{send}  = P{send  | no  queueing  delay}  P{no  queueing  delay} 
+ P{send  | queueing  delay}  P{queueing  delay} 


M 

£ P | send  to  jth  preference  computer  and  no  queueing > 
j=2  delay  * 

+ f^]  • p{Sk>h} 


l P|B1B2- 


. B . . F . I 
1-1  1 » 


[¥]  • " I SK  ’ " I 


Since  the  probabilities  P^B1B2  • • • ^ } were  derived  from  an 

approximation^  to  Tiypercube  queueing  model,  they  must  be  normalized 


The  exact  solution  to  the  hypercube  queueing  model  is  obtained  by  solv- 
ing the  equations  of  detailed  balance  for  the  continuous  time,  finite 
state  Markov  process  which  describes  the  behavior  of  the  system.  For 
systems  described  by  M/M/N  multiservers  with  distinguishable  servers, 
close  agreement  has  been  found  between  the  approximation  and  exact  results. 
[Refs.  23  and  14] . 
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IL 


by  the  condition 


J2  •••  Vi1  + P{P1SK<»}  ' PK<»} 


(4.4) 


where 


P(P1SK  < N1  ’ l P(F1  I V P,V 


M— 1 

l 

K=0 

N-l 


K=0 


H*  p{sk} 


A good  technique  for  accomplishing  this  normalization  is  to  simply  scale 


each  of  the  p{b  B . . . B.  F.}  so  that  Equation  4.4  is  met.  When  this 
. 2 j“l  1 


normalization  technique  is  used  one  can  substitute 


| P(B  B . . . BP)  - P(SK  , ) - T ‘,lf1  I SK>  PfSK> 

j=2  K=0 


into  Equation  4.3  giving 


II— 1 

P(send)  = P{SK  < N>  ~ l I SK>  P(SK) 

K=0 


r n 1 1 . P{s  } 

[ N J K > N 


(4.5) 
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The  probability  of  sending  a job  can  now  be  easily  determined.  As 
stated  previously,  one  can  then  include  the  probability  of  not  sending 
a job,  1 - Ptsend),  as  an  impulse  at  the  origin  of  the  density  function 
for  the  service  time  of  the  communication  stages  as  shown  in  Figure  4.3. 
This  improves  the  approximation  of  dynamic  load  sharing  performance  by 
reducing  the  expected  communication  time.  The  improved  approximation 
will  now  be  used  to  examine  several  examples. 

Examples  of  Dynamic  Load  Sharing  Performance 
The  following  are  examples  of  the  previously  described  dynamic  load 
sharing  technique  used  in  computer-communication  networks  where  the 
average  rates  of  inputs  at  all  computers  are  the  same.  The  preference 
list  for  dynamic  assignments  is  such  that  this  balance  is  maintained. 

The  networks  considered  are  all  fully  connected  communication  networks 
so  that  the  previous  discussion  about  queueing  in  communication  channels 
applies. 

Consider  first  a three  computer  system  in  which  the  mean  communication 
time  for  one  channel  is  1/PC  = 1/10&R.  A qraph  of  the  probability  of  send- 
ing a job  vs.  system  load  for  this  case  is  shown  in  Figure  4.4.  Note  that 
near  = 0 the  probability  of  sending  a job  is  zero  because  the  computer 
of  oriqin  is  always  available  when  a job  arrives.  The  system  also  has 
a pole,  at  which  point  P{send}  = — - - — . This  is  because,  at  system 


Figure  4.4  Probability  of  Sending  a Job  in  a Three  Computer  System 
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saturation,  a job  is  queued  whenever  it  arrives  and  when  this  occurs 
the  probability  of  sending  a job  is  (N  - 1)/N.  The  system  pole  occurs 
when  the  system  utilization  factor,  p,  equals  1.  At  this  point  the 
mean  service  time  is  1/£R  + P{send) (4/JlR) . This  means  that  saturation 
occurs  when  X^  = N£R/[1  + .4p{send}]. 

A graph  of  expected  job  time  vs.  system  load  for  this  three  computer 
case  is  shown  in  Figure  4.5.  It  can  be  seen  that  use  of  the  hypercube 
approximation  to  determine  the  probability  of  sending  a job,  puts  the 
performance  curve  for  a system  with  well  within  the  dynamic 

load  sharing  region.  As  explained  above,  the  system  still  has  a pole  at 
X^  < NZR,  but  for  load  levels  less  than  X^  = 2.1,  dynamic  gains  are  clear- 
ly indicated. 

Examples  of  five  and  ten  computer  systems  will  now  be  considered 
in  order  to  show  how  our  estimates  of  dynamic  load  sharing  gains  vary 
as  a function  of  system  size.  Figures  4.6  and  4.7  show  the  performance 
curves  for  five  and  ten  computer  systems  respectively.  As  before 


— = ■■  . These  examples  show  that  as  system  size  increases,  better 

UC  10JIR 

dynamic  performance  is  attained  at  load  levels  below  the  system  pole. 
This  is  because  as  the  size  of  the  system  increases,  the  probability 
that  a job  will  be  queued  before  being  assiqned  to  a computer  decreases. 

Figures  4.6  and  4.7  also  show  a slight  shift  of  the  system  pole  as 
the  number  of  computers  in  the  system  changes.  The  system  pole  for 
dynamic  load  sharing  occurs  when 


Dynamic  Load  Sharing 


Using  Dynamic 
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at  saturation 


= N/[mean  service  time]  = 


4P{send}  at  saturation 


m/H-L  + 4<N  - 1) 

n/[)Ir  nuc 


Therefore 


A„  at  saturation 


4 (N  - 1)  S-R 


Since  the  term  (N  - 1)/N  varies  slowly  as  a function  of  system  size, 
the  system  pole  location  relative  to  n£r  changes  slightly  as  system  size 


changes. 

Equation  4.6  can  be  used  to  examine  the  system  pole  location  as  a 
function  of  mean  communication  service  time.  The  location  of  the  system 
pole  is  determined  primarily  by  the  ration  HR/pC.  For  hiah  capacity 


communication  systems,  this  ratio  is  small  and  the  system  }>ole  is  there- 
fore close  to  = N R.  As  communication  capacity  decreases,  the  ratio 
JlR/gC  increases  and  the  system  pole  moves  toward  the  origin.  For  this 
reason,  the  dynamic  load  sharing  technique  described  here  is  effective 
only  with  high  capaicty  communication  networks. 

The  reason  that  the  dynamic  load  sharing  technique  presented  here 
requires  a high  capaicty  communication  network  is  that  if  a job  is  sent 


to  be  processed  at  a computer  other  than  the  one  at  which  it  was  sub- 


mitted,  the  computer  used  to  process  the  job  is  reserved  for  it  during 
the  communication  time  required  to  send  the  job.  This  is  done  so  that 
the  job  is  assured  of  finding  the  computer  available  when  it  arrives  to 
be  processed.  This  works  well  for  a high  capacity  communication  net- 
work, but  for  a low  capacity  network  it  means  that  computers  are  reserved 
for  large  amounts  of  time  during  which  they  provide  no  service.  This 
moves  the  system  pole  towards  the  origin  as  communication  capacity  is 
decreased  (communication  delay  is  increased) . Therefore,  a dynamic 
load  sharing  technique  used  with  a low  capacity  communication  network 
must  eliminate  the  reservation  of  computers  during  communication  time, 
as  is  discussed  in  the  next  section. 

j 

4.2  Dynamic  Load  Sharing  Using  a Low  Capaicty  Communication  Network 

As  shown  in  the  previous  section,  when  usinq  a dynamic  load  sharing 
technique  in  a low  capacity  communication  network,  one  cannot  afford  to 
reserve  computers  during  the  time  jobs  are  communicated  to  them.  This 
means  that  when  a job  is  sent  to  be  processed  at  another  computer  in  a 
low  capacity  system,  it  may  incur  a qucueinq  delay  at  the  processinu 
computer  in  addition  to  the  significant  communication  delay  it  incurs. 

One  must  therefore  consider  the  policy  of  queueing  a job  at  the  computer 
of  oriqin  when  that  computer  is  busy,  even  if  there  is  another  computer 
that  is  not  busy  in  the  network. 

I 
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If  one  considers  queueing  jobs  at  busy  computers  while  others  are 
not  busy,  one  is  faced  with  the  question  of  how  many  jobs  to  queue 
before  it  is  better  to  send  a job  elsewhere  to  be  processed.  Since  the 
hypercube  model  does  not  allow  this  type  of  operation,  one  must  seek 
other  techniques  of  analysis.  Conceptually,  one  way  in  which  to  analyze 
a system  which  allows  this  type  of  operation  is  to  model  each  of  the 
computers  and  communication  channels  as  queues  operating  in  discrete 
time  with  finite  length  buffers.  The  system  operation  is  then  described 
by  a discrete  time,  finite  state  Markov  process  which  could  be  analyzed 
for  various  dynamic  load  sharing  policies.  At  present,  however,  it  is 
not  feasible  to  use  this  analysis  because  of  the  extremely  large  state 
space  required  by  the  problem.  The  analysis  of  general  dynamic  load 
sharing  techniques  therefore  remains  an  area  open  for  further  research. 


CHAPTER  V CONCLUSION  AND  SUGGESTIONS 
FOR  FURTHER  RESEARCH 

5.1  Conclusion 

This  study  of  load  sharing  in  a computer-communication  network  has 
shown  that  load  sharing  can  provide  improvements  in  the  expected  time 
to  process  jobs  in  a distributed  computer  system.  Upper  and  lower 
bounds  for  this  performance  criteria  were  developed  and  two  techniques 
for  load  sharing  were  investigated  using  queueing  models.  Specifically, 
it  has  been  shwon  that  statistical  load  sharing  can  be  used  to  improve 
expected  job  time  by  correcting  load  imbalances.  Most  importantly,  the 
correction  of  these  load  imbalances  allows  the  system  to  operate  at  high- 
er throughput  levels  than  is  possible  without  load  sharing.  To  obtain 
improvements  in  expected  job  time  beyond  those  possible  with  a simple 
technique  such  as  statistical  load  sharing,  it  is  necessary  to  use  a 
dynamic  load  sharing  technique.  One  such  technique  was  investigated 
and  shown  to  give  significant  dynamic  gains  if  used  with  a high  capacity 
communication  network.  It  was  also  shown  that  load  sharinq  capabilities 
can  improve  system  reliability  by  making  the  system  fail  soft,  at  the 
expense  of  degraded  performance. 

Computer-communication  networks  today  are  increasing  the  capabilities 
cf  computer  systems  bv  providing  the  means  for  remote  access  to  time- 
shared  computer  facilities,  data  base  sharing  and  the  sharinq  of  unique 
computer  resources.  Because  of  the  definite  improvements  in  system 
expected  job  time  and  reliability  that  load  sharinq  can  provide,  provisions 
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for  load  sharing  should  be  given  serious  consideration  in  the  design  of 
future  computer-communication  networks.  A study  of  the  performance 
curves  for  statistical  load  sharing  and  dynamic  load  sharing  shows  that 
it  is  most  important  to  balance  out  severe  load  imbalances.  The  per- 
formance improvement  that  is  gained  by  simply  balances.  The  performance 
improvement  that  is  gained  by  simply  balancing  the  average  load  is  far 
greater  than  the  additional  improvement  gained  by  doing  dyanamic  job 
assignment.  This  indicates  that  in  actual  implementations  of  load 
sharing,  it  may  be  sufficient  to  make  load  sharing  policy  decisions  on 
a periodic  basis  to  balance  average  loads,  rather  than  to  make  a decision 
based  on  system  state  for  every  job.  An  important  result  of  this  study 
is  the  identification  of  the  load  sharing  problem  as  a multicommodity 
flow  problem.  This  means  that  as  progress  is  made  in  solving  the  problem 
of  dynamic  control  of  other  multicommodity  flow  problems,  such  as 
message  routing  in  a packet  switched  communications  network,  the  results 
can  be  applied  to  the  load  sharing  problem. 

5.2  Suggestions  for  Further  Research 

The  upper  and  lower  bounds  on  system  performance,  that  were 
developed  in  Chapter  2,  provide  a frame  of  reference  within  which  to 
evaluate  load  sharing  techniques.  It  would  be  of  interest  to  examine 
load  sharing  techniques  other  than  those  presented  here,  such  as  the 
dynamic  technique  of  Roome  and  Tornq  [Ref.  31] , within  this  frame  of 
reference. 
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In  order  to  achieve  a better  understanding  of  the  benefits  of  load 
sharing,  it  would  be  of  interest  to  obtain  a more  complete  statistical 
description  of  performance  than  just  expect  job  time.  Higher  moments 
of  job  time  distributions,  or  even  better,  complete  job  time  distributions 
would  be  of  value. 

Since  a computer-communication  network  is  a dynamic  system,  it  is 
important  to  understand  its  transient  operation  as  well  as  its  steady 
state  operation.  A transient  situation  that  is  of  particular  interest 
is  the  system  response  to  a temporary  overload  at  one  of  the  computers. 

As  suggested  in  Chapter  4,  another  idea  for  further  study  is  to 
model  each  of  the  computers  and  conanunication  channels  as  queues 
operating  in  discrete  time  with  finite  length  buffers  and  to  use  a 
discrete  time,  finite  state  Markov  process  analysis  to  study  general 
dynamic  load  sharing  techniques.  In  order  to  use  this  approach,  one 
must  first  find  ways  to  handle  the  problem  of  the  extremely  large  state 
space  generated  by  this  model. 

Another  suggestion  for  further  research  is  to  consider  reliability 
improvements  using  load  sharing  techniques  in  systems  where  the  com- 
munication channels  are  subject  to  degradation  rather  than  total  failure. 
An  example  of  such  degradation  is  a change  in  signal  to  noise  ration  in 
a radio  channel. 

A final  suggestion  for  further  research  is  to  investigate  dynamic 
load  sharing  operation  using  control  schemes  that  do  not  assume  a global 
controller  using  an  instantaneous  communication  system  separate  from 
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the  communication  network  used  for  load  sharing.  It  is  quite  likely 
that  in  actual  implementations  of  load  sharing,  there  would  not  be  a 
separate  global  controller  and  that  control  information  would  be  sent 

•i  : . 

via  the  same  communication  network  as  the  computer  programs  and  results. 
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APPENDIX  A QUEUEING  FORMULAS 


This  appendix  gives  the  basic  queueing  formulas  used  in  this 
study.  Derivations  of  the  formulas  in  Sections  A.l  and  A. 2 can  be 
found  in  standard  references  on  queueing  theory  such  as  Cohen  [Ref.  5] 
(M/M/1  queue  only).  Gross  and  Harris  (Ref.  8],  Hillier  and  Lieberman 
[Ref.  10]  or  Saaty  [Ref.  33] . 

A.l  The  M/M/1  Queue 

The  M/M/1  queue  is  a single  server  queue  with  a Poisson  arrival 
process  with  mean  arrival  rate  X and  a negative  exponential  service 
time  distribution  with  mean  service  time  1/p.  The  queue  has  a steady 
state  solution  only  when  the  utilization  factor,  p = X/p,  is  less  than 
1.  If  this  is  the  case,  the  steady  state  disbritubion  of  the  number 
of  customers  in  the  system  is  given  by 


PK  = (1  - p)  p 


K = 0,1,2.  . . 


where  p = X/p  < 1 

Using  this  steady  state  distribution,  it  can  be  shown  that  the 
distribution  of  the  time  to  pass  through  the  M/M/1  queue  is  also 
exponential  with  mean 


E [T]  = 


W - X 
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A.2  The  M/M/N  Queue 

The  M/M/N  queue  is  an  N server  queue,  also  with  Poisson  input 


(mean  arrival  rate  A)  and  exponential  service  time  (mean  service  time 
1/p).  The  steady  state  distribution  of  the  number  of  customers  in  the 


system  is  given  by 


<a/mP 

K!  0 


if  0 < K < N 


(A/p)~  Pr 


if  K > N 


where 


(A/P) 

N! 


1 - (A/Np) 


By  solving  for  the  expected  number  of  customers  in  the  system  and 
applying  L = XW,  it  follows  that  the  expected  time  to  pass  through 
the  system  is  given  by 


Pn  < A/P)  0 , 

EtT]  - -2 j ♦ i 

A N!  (1  -P)  V 


where  p = A/Np 


I J| | 7 J| ■■■  •■ — ■■■ — - •-■- - 
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A.3  The  M/E^/N  Queue 

The  M/E^/H  queue  is  an  N server  queue  with  a k th  order  Erlang 
service  time  distribution.  The  technique  used  to  analyze  this  queue 
is  given  by  Heffer  [Ref.  9] . The  distribution  of  the  number  of 
customers  in  the  system  and  the  resulting  value  of  the  expected  time 
to  pass  through  the  system  are  not  given  by  convenient  closed  form 
solutions.  Numerical  results,  however,  are  available  in  Hillier  and 
Lo  [Ref.  11].  These  numerical  results  were  used  to  perform  the 
calculations  in  this  study. 

In  order  to  calculate  performance  curves  for  the  hypercube 
approximation  model,  the  sequence  of  calculations  that  one  follows  is 

1.  Given  the  value  of  the  total  system  utilization  factor, 
calculate  the  probability  of  sending  a job. 

2.  Using  the  probability  of  sending  a job  and  the  system 
utilization  factor,  calculate  the  total  system  load 

A at  which  this  utilization  factor  occurs. 

T 

3.  Calculate  the  expected  job  time  by  using  Little's  formula 

L = Aw  [Ref.  23]  where  L is  the  expected  number  of  customers 
in  the  system  at  this  system  utilization  factor  and  W is  the 
desired  job  time. 


-■  ■ 
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APPENDIX  B PROOF  OF  THE  APPLICABILITY  OF  THE 
NETWORK  OF  QUEUES  MODEL  TO  STATISTICAL  LOAD  SHARING 


The  analysis  of  the  network  of  queues  model  for  statistical  load 
sharing  was  made  possible  by  a theorem  by  Jackson  [Ref.  12]  which  states 
that  in  steady  state,  the  probability  distribution  of  the  state  of  the 
network  can  be  written  in  a product  form.  The  terms  in  the  product  are 
the  distributions  of  the  state  (number  of  customers)  at  each  queue  in 
the  network  considered  as  a separate  independent  queue  with  the  appropri- 
ate input  rate.  The  purpose  of  this  appendix  is  to  show  that  the  statis- 
tical load  sharing  problem  meets  the  requirements  of  the  Jackson  theorem. 

The  requirements  of  the  Jack son  theorem  are  that 

1.  Customers  from  outside  the  system  arrive  at  each  queue 
as  a Poisson  stream. 

2.  Once  served  at  queue  m,  a customer's  destination  is 
dgj^ermined  by  a random  sampling.  VJith  probability 
0*^"  he  goes  (instantaneously)  to  queue  K (k  = 1,2, 

. . . M)  and  with  probability 


1 - l % 

k=l 

he  leaves  the  system. 


3.  Each  queue  has  an  exponential  service  time  distribution 
and  serves  all  arriving  customers  (from  inside  or  out- 
side) in  a first-come-first-served  manner. + 

The  third  requirement  is  clearly  net  by  the  statistical  load  shar- 
ing problem,  since  )x>th  communication  channels  and  computers  are  modeled 


A first-come-first-served  discipline  is  sufficient,  hut  not  necessary. 
[Ref.  22]. 


1 
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as  exponential  servers.  The  other  two  requirements,  however,  need  to 
be  examined  more  closely. 

The  requirement  that  customers  from  outside  the  system  arrive  as  a 
Poisson  stream  is  met  by  assumption  at  all  computer  queues  which  do  not 
send  jobs  on  to  be  processed  elsewhere.  At  computer  queues  where  some 
of  the  jobs  are  sent  elsewhere,  customers  arrive  as  a random  process 
which  is  a random  sampling  of  a Poisson  input  stream.  This  is  also 
true  of  communication  queues  which  are  used  to  forward  jobs  arriving 
from  outside  the  system  (computer  programs).  Therefore,  in  order  to 
show  that  the  first  requirement  of  the  Jackson  theorem  is  met,  it  is 
necessary  to  show  that  a random  sampling  of  a Poisson  process  yields 
another  Poisson  process.  Such  a proof  is  given  below. 


Proof  that  a Random  Sampling  of  a Poisson 
Process  Yields  Another  Poisson  Process 


A Poisson  arrival  process  with  rate  parameter  A is  a renewal  process 
for  which  the  interarrival  times  are  exponentially  distributed  with  mean 
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Now  consider  a random  sampling  of  this  process  in  which  arrivals  are 
counted  with  probability  8 and  not  counted  with  probability  1-3. 

Then  the  Laplace  transform  of  the  density  function  for  the  time  L between 
successive  arrivals  which  are  counted  is 


L {fL( A)  > = Ufw  (wi)>  6 + L{ fw  (w2)}3(1-  3)  + L{fw  (w3>  } 6(1-8)  + .. 


I I-  {fw  (wn)J  3(1-8) ’ 
n=l  n 


where  is  the  sum  of  n interarrival  times  in  the  underlying  Poisson 
process. 

Since  the  transform  of  the  density  of  a sum  of  statistically  independent 
random  variables  is  the  product  of  the  transforms  independent  random 
variables  is  the  product  of  the  transforms  of  each  random  variable  in 
the  sum,  the  expression  becomes 


L (f.  (A)  } = l ( -jy  )"  8(1-8)’ 


Therefore  L is  distributed  exponentially  with  mean  1/BA.  Since  succes- 
sive interarrival  times  in  the  sampled  process  are  also  statistically 
independent,  the  sampled  process  is  Poisson  with  rate  parameter  8A.  Q.E.D. 


The  final  requirement  that  the  statistical  load  sharing  problem 
must  meet  is  that,  once  serviced  at  a queue,  the  destination  of  a job 
is  determined  by  random  sampling.  This  results  in  a random  routing 
through  the  network  of  queues.  In  the  statistical  load  sharing  problem, 
the  routing  of  a job  is  random  until  it  has  passed  through  a computer 
queue.  Once  it  has  passed  through  a computer  queue,  it  must  be  returned 
to  the  computer  of  origin  before  it  can  leave  the  system.  This  deter- 
ministic routing  in  the  statistical  load  sharing  problem  must  therefore 
be  shown  to  still  meet  the  requirements  of  the  Jackson  theorem. 

Consider  a computer  queue  which  services  both  jobs  submitted  to  it 
directly  (arriving  at  rate  X^)  and  jobs  sent  from  an  overloaded  computer 
(arriving  at  rate  X2) . At  the  output  of  this  queue,  it  must  be  decided 
whether  a job  leaves  the  system  (if  it  was  submitted  to  the  computer 
directly)  or  if  it  is  to  be  sent  over  a specific  communication  channel 
(if  it  came  from  the  overloaded  computer).  Assume  that  the  decision 
is  made  on  the  basis  of  a taq  which  identifies  the  origin  of  the  job. 

In  order  for  this  decision  to  meet  the  requirements  of  the  Jackson  theorem, 
it  must  produce  output  streams  of  customers  that  appear  as  if  the  decision 
was  made  by  random  sampling. 

When  a random  decision  rule  is  used,  the  output  streams  from  the 
computer  queue  are  both  Poisson.  This  follows  from  the  fact  that  the 
output  of  an  exponential  server  with  Poisson  input  is  Poisson,  as  has 
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; 


been  shown  by  Rurke  [Ref.  3],  This  property  of  exponential  servers  makes 
all  job  streams  in  the  network  of  queues  Poisson  when  a random  decision 
is  used.  The  statistical  load  sharing  problem  must  therefore  also 
aenerate  Poisson  streams  at  all  points  in  the  network. 

As  stated  before,  the  routing  decision  at  the  output  of  the  computer 
under  consideration  is  made  on  the  basis  of  a tag  which  identifies  the 
origin  of  the  job.  The  sequence  of  decisions  that  are  made  at  the  out- 
put of  the  queue  are  generated  by  the  order  in  which  jobs  arrive  at  the 
input  of  the  computer  becuase  all  jobs  are  served  in  a first-come-first- 
served  manner.  In  a sequence  cf  routing  decisions,  the  probability 
that  the  next  job  is  of  origin  1 is  the  probability  that,  in  the  input 
stream,  the  job  which  arrived  immediately  after  the  job  whose  routing 
has  just  been  determined  was  of  origin  1.  This  probability  is 

+ X^) , independent  of  all  previous  outputs  because  jobs  arrive 
at  the  input  of  the  computer  as  independent  Poisson  streams  of  rates 
A and  The  sequence  of  decisions  made  at  the  output,  therefore 

appears  to  an  observer  at  that  point  to  be  a purely  random  sequence  and 
the  resulting  output  streams  with  different  destinations  are  therefore 
Poisson  as  required  by  the  Jackson  theorem. 

Another  way  to  show  that  the  statistical  load  sharing  problem  meets 
the  requirements  of  the  Jackson  theorem  is  to  apply  the  idea  of  a job 
routing  determined  by  an  Nth  order  Harkov  chain  as  has  been  done  by 
Kobayashi  and  Reiser  [Ref.  22], 
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N: 

V 

V 
1/1: 

V 

1/up: 

1/ur: 

C.  s 
x 

E[T]  s 
E [Ti] : 

p: 

B: 


HCs 

p{send}: 


LIST  OF  SYMBOLS 

number  of  computers  in  the  system. 

mean  total  arrival  rate  of  computer  jobs  in  the  system. 

mean  arrival  rate  of  computer  jobs  at  the  i th  computer. 

mean  number  of  operations  required  per  computer  job. 

rate  at  which  the  i th  computer  performs  operations. 

mean  message  length  in  bits  for  computer  programs. 

mean  message  length  in  bits  for  computer  results. 

channel  capacity  in  bits  per  unit  time  of  the  i th 
conmunication  channel. 

system  expected  job  time. 

expected  time  to  process  a computer  job  which  enters  the 
system  at  the  i th  computer. 

utilization  factor  of  a queue. 

probability  of  sending  a job  which  arrives  at  an  overloaded 
computer  to  a specific  underloaded  computer  using  statistical 
load  sharing. 

flow  rate  of  computer  jobs  through  the  i th  computer. 

flow  rate  of  computer  programs  through  the  i th  communication 
channel. 

flow  rate  of  computer  results  through  the  i th  communication 
channel. 

number  of  communication  channels  in  the  network. 

probability  of  sending  a job  in  a dynamic  load  sharing  system, 

probability  of  findinq  the  i th  computer  sampled  in  a dynamic 
load  sharing  system  to  bo  busy. 
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